YOCO

Cross-Attention KV Caching (You Only Cache Once)

The design substantially reduces GPU memory demands, yet retains global attention capability.

YOCO introduces a decoder-decoder architecture, caching key-value pairs once. The self-decoder produces global KV caches using efficient self-attention mechanisms, while the cross-decoder reuses these caches, reducing memory demands.

YOCO cuts GPU memory usage by up to 10 times, decreases prefill latency from 180s to under 6s for 512K tokens. It scales efficiently with increased model sizes, handles up to 1M token contexts with near-perfect needle retrieval accuracy.

Specifically, YOCO stacks cross-decoder upon self-decoder. Given an input sequence, the self-decoder utilizes linear efficient self-attention (e.g.,

Sliding window attention,

Gated Retention) to obtain KV caches. Then the cross-decoder layers employ cross-attention to reuse the shared KV caches

The word “once” refers to global KV cache. Strictly, self-decoder also needs to store a certain number of caches. As the self-decoder utilizes an efficient attention module, the cache size is bounded to a constant, which can be ignored compared to global caches when the sequence length is large.

이해한 바로 요약하자면 linear attention으로만 캐시 슥 뽑아두고, 적당히만 캐시해서 inference나 training 퍼포먼스랑 kv 캐시 용량 최소화하겠다는 것. 아이디어 자체는 cross attention과 결합하여 kv를 재사용한다는 점에서 참신하고 적당한 최적화라 실사용에 쓸모가 있겠지만, efficient attention block 추가하는 모델 구조에 의존적이라 산업에서 널리 쓰일지는 의문이다.

arxiv.org

https://arxiv.org/pdf/2405.05254

unilm/YOCO at master · microsoft/unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities - microsoft/unilm

https://github.com/microsoft/unilm/tree/master/YOCO

YOCO

Cross-Attention KV Caching (You Only Cache Once)

Recommendations