FastGen

Creator

Seonglae Cho

Created

2024 Oct 26 13:55

Editor

Seonglae Cho

Edited

2024 Oct 26 14:14

Refs

KV compression based on
Attention head types

Evicting long-range contexts on attention heads emphasizing local contexts, discarding non-special tokens on attention heads centered on special tokens and only employing the standard KV cache for attention heads that broadly attend to all tokens.

based on light weight attention profiling

Local Context head - evicting long-range contexts

Special Token head - discarding non-special tokens

openreview.net

https://openreview.net/pdf?id=uNrFpDPMyo

Recommendations

///////////

FastGen

KV compression based on Attention head types

Recommendations

KV compression based on
Attention head types