Grouped-query Attention

Creator
Creator
Seonglae Cho
Created
Created
2023 Oct 14 14:24
Editor
Edited
Edited
2025 Feb 19 22:3

GQA

If these models used 4x GQA, the size and required bandwidth for KV-cache would have been 4x smaller
Can be applied retroactively as a generalization that includes MHA and MQA
notion image
 
 

KV Cache & GQA

It would be interesting to see the technical report as it may contain relevant ablation studies, but purely from the cost/performance point of view, GQA needs to be evaluated for every transformer-based LLM as the benefits are too significant to ignore.
When you ask ChatGPT to help with a task, your request is evaluated concurrently with many other requests on the same GPU, and the bandwidth is utilized more efficiently
 
 

Recommendations