Grouped-query Attention

Creator
Creator
Alan JoAlan Jo
Created
Created
2023 Oct 14 14:24
Editor
Editor
Alan JoAlan Jo
Edited
Edited
2024 Mar 18 13:36

GQA

If these models used 4x GQA, the size and required bandwidth for KV-cache would have been 4x smaller
MHA와 MQA를 포함한 일반화라고 볼수도
GQA는 사후에 적용이 가능
notion image
 
 

KV Cache & GQA

It would be interesting to see the technical report as it may contain relevant ablation studies, but purely from the cost/performance point of view, GQA needs to be evaluated for every transformer-based LLM as the benefits are too significant to ignore.
When you ask ChatGPT to help with a task, your request is evaluated concurrently with many other requests on the same GPU, and the bandwidth is utilized more efficiently
 
 

Recommendations