More cache in lower layers, less in higher layers
LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers, progressively consolidating within specific contexts, and ultimately focusing on critical tokens
Motivated by these insights, PyramidKV dynamically adjusts the KV cache size across different layers, allocating more cache in lower layers and less in higher ones.
Alpha (α) is a hyperparameter that defines the number of last tokens retained across all layers, as they hold recent, crucial information. Intermediate cache sizes then follow an Arithmetic sequence to form a pyramid, optimizing memory allocation per layer.