It is mainly used in Transformer Inference. In the training stage, all input data is usually processed at the same time only once (unlike generation step repeats using same tokens), and computations for all inputs are processed in batches, so the benefits of KV cache are not that significant and there are many other things to memory so it is not used. Also, it is risky if memory changes dynamically.
After storing the Key/Value tensor in GPU memory for reuse, it often takes up a lot of memory, and in some cases, it is larger than the model itself.
The default setting for the parameter option in Huggingface transformer's generate function is true.
- Number of decoder layers: More layers mean more KV caches
- Sequence length: Longer sequences increase memory usage as they need to store more pairs of keys and values
- Dimension of the model: The larger the dimensions of the keys and values, the more data needs to be stored
There are several techniques such as KV compression based on Attention head types (FastGen) and More cache in lower layers, less in higher layers (PyramidKV). Combining those methods could implement memory-optimized Transformer model inference.
Attention KV Caches
KV Cache in production
Cross Layer KV-sharing with stateful caching