Attention KV Cache

Creator
Creator
Alan JoAlan Jo
Created
Created
2024 Mar 2 7:7
Editor
Editor
Alan JoAlan Jo
Edited
Edited
2024 Mar 31 15:44
Refs
Refs
It is mainly used in
Transformer Inference
. In the training stage, all input data is usually processed at the same time only once (unlike generation step repeats using same tokens), and computations for all inputs are processed in batches, so the benefits of KV cache are not that significant and there are many other things to memory so it is not used. Also, it is risky if memory changes dynamically.
After storing the Key/Value tensor in GPU memory for reuse, it often takes up a lot of memory, and in some cases, it is larger than the model itself.
The default setting for the parameter option in Huggingface transformer's generate function is true.
  1. Number of decoder layers: More layers mean more KV caches
  1. Sequence length: Longer sequences increase memory usage as they need to store more pairs of keys and values
  1. Dimension of the model: The larger the dimensions of the keys and values, the more data needs to be stored
Attention KV Caches
 
 
 
 

KV Cache in production

 
 

Recommendations