FastGen

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Oct 26 13:55
Editor
Edited
Edited
2024 Oct 26 14:14
Refs
Refs

KV compression based on
Attention head
types

Evicting long-range contexts on attention heads emphasizing local contexts, discarding non-special tokens on attention heads centered on special tokens and only employing the standard KV cache for attention heads that broadly attend to all tokens.
based on light weight attention profiling
  • Local Context head - evicting long-range contexts
  • Special Token head - discarding non-special tokens
 
 
 
 
 
 

Recommendations