KV compression based on Attention head types
Evicting long-range contexts on attention heads emphasizing local contexts, discarding non-special tokens on attention heads centered on special tokens and only employing the standard KV cache for attention heads that broadly attend to all tokens.
based on light weight attention profiling
- Local Context head - evicting long-range contexts
- Special Token head - discarding non-special tokens