It is mainly used in Transformer Inference. In the training stage, all input data is usually processed at the same time only once (unlike generation step repeats using same tokens), and computations for all inputs are processed in batches, so the benefits of KV cache are not that significant and there are many other things to memory so it is not used. Also, it is risky if memory changes dynamically.
After storing the Key/Value tensor in GPU memory for reuse, it often takes up a lot of memory, and in some cases, it is larger than the model itself.
The default setting for the parameter option in Huggingface transformer's generate function is true.
- Number of decoder layers: More layers mean more KV caches
- Sequence length: Longer sequences increase memory usage as they need to store more pairs of keys and values
- Dimension of the model: The larger the dimensions of the keys and values, the more data needs to be stored
There are several techniques such as KV compression based on Attention head types (FastGen) and More cache in lower layers, less in higher layers (PyramidKV). Combining those methods could implement memory-optimized Transformer model inference.
Attention KV Caches

KV Cache in production
LLM inference speed of light
In the process of working on calm, a minimal from-scratch fast CUDA implementation of transformer-based language model inference, a critical consideration was establishing the speed of light for the inference process, and measuring the progress relative to that speed of light. In this post we’ll cover this theoretical limit and its implications.
https://zeux.io/2024/03/15/llm-inference-sol
LLM 인퍼런스 훑어보기 (2) - KV Cache
앞선 포스트에서는 Large Language Model (LLM) 인퍼런스의 중요성과 왜 우리는 LLM을 효율적으로 활용해야 하는지를 알아보았다. 더불어, LLM을 이용한 문장 생성은 autoregressive generation이며, 해당 생성 과정에서 사용할 수 있는 다양한 디코딩 전략을 소개하였다. 지난 포스트는 아래 링크를 참조하자. https://dytis.tistory.com/53 LLM 인퍼런스 훑어보기 (1) - LLM을 이용한 문장 생성 인공 지능과 기계 학습 기술의 발전은 현대 사회에 혁명적인 변화를 가져왔다. 특히, Large Language Model(LLM)과 같은 최신 기술은 자연어 처리 및 이해 분야에서 차별화된 성능을 보이며, 다양한 분 dytis.tistory.com 이번 ..
https://dytis.tistory.com/54
Long Context로 인한 Large KV Cache의 문제점과 해결 방안: Part I-KV cache의 메모리 요구량
Auto-regressive 모델이란 이전 단계의 출력들을 이용하여 다음 단계의 출력을 예측하는 모델이다. GPT는 auto-regressive 모델로 이전에 생성된 토큰를 기반으로 다음 토큰을 생성한다. GPT는 이전 토큰 생성 시 발생된 중간값인…
https://moon-walker.medium.com/long-context로-인한-large-kv-cache의-문제점과-해결-방안-part-i-kv-cache의-메모리-요구량-025f3d5dea93

Cross Layer KV-sharing with stateful caching
Optimizing AI Inference at Character.AI
At Character.AI, we're building toward AGI. In that future state, large language models (LLMs) will enhance daily life, providing business productivity and entertainment and helping people with everything from education to coaching, support, brainstorming, creative writing and more. To make that a reality globally, it's critical to achieve highly
https://research.character.ai/optimizing-inference

Inference and KV cache
All About Transformer Inference | How To Scale Your Model
Performing inference on a Transformer can be very different from training. Partly this is because inference adds a new factor to consider: latency. In this section, we will go all the way from sampling a single new token from a model to efficiently scaling a large Transformer across many slices of accelerators as part of an inference engine.
https://jax-ml.github.io/scaling-book/inference/

Seonglae Cho
