KV Cache Compression

KV Cache Compression uses KV cache like external memory similar to RAG, where KV caches are compressed and accumulated offline by performing inference once for external documents per task. This allows for more practical use of native embeddings in online settings compared to RAG with minimum online warmup. While document scaling may be challenging for global tasks with this less scalable method, it's an innovative that can be particularly useful in specific industry domains.

arxiv.org

https://arxiv.org/pdf/2503.04973

KV Cache Compression

Recommendations