PyramidKV

More cache in lower layers, less in higher layers

LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers, progressively consolidating within specific contexts, and ultimately focusing on critical tokens

Motivated by these insights, PyramidKV dynamically adjusts the KV cache size across different layers, allocating more cache in lower layers and less in higher ones.

Alpha (α) is a hyperparameter that defines the number of last tokens retained across all layers, as they hold recent, crucial information. Intermediate cache sizes then follow an

Arithmetic sequence to form a pyramid, optimizing memory allocation per layer.

arxiv.org

https://arxiv.org/pdf/2406.02069

PyramidKV

More cache in lower layers, less in higher layers

Backlinks

Recommendations