KV Cache

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Mar 2 7:7
Editor
Edited
Edited
2024 Oct 26 14:19
It is mainly used in
Transformer Inference
. In the training stage, all input data is usually processed at the same time only once (unlike generation step repeats using same tokens), and computations for all inputs are processed in batches, so the benefits of KV cache are not that significant and there are many other things to memory so it is not used. Also, it is risky if memory changes dynamically.
After storing the Key/Value tensor in GPU memory for reuse, it often takes up a lot of memory, and in some cases, it is larger than the model itself.
The default setting for the parameter option in Huggingface transformer's generate function is true.
  1. Number of decoder layers: More layers mean more KV caches
  1. Sequence length: Longer sequences increase memory usage as they need to store more pairs of keys and values
  1. Dimension of the model: The larger the dimensions of the keys and values, the more data needs to be stored
There are several techniques such as KV compression based on
Attention head
types (
FastGen
) and More cache in lower layers, less in higher layers (
PyramidKV
). Combining those methods could implement memory-optimized Transformer model inference.
Attention KV Caches
 
 
 

KV Cache in production

Cross Layer KV-sharing with stateful caching
 
 

Recommendations