GQA
If these models used 4x GQA, the size and required bandwidth for KV-cache would have been 4x smaller
MHA와 MQA를 포함한 일반화라고 볼수도
GQA는 사후에 적용이 가능
KV Cache & GQA
It would be interesting to see the technical report as it may contain relevant ablation studies, but purely from the cost/performance point of view, GQA needs to be evaluated for every transformer-based LLM as the benefits are too significant to ignore.
When you ask ChatGPT to help with a task, your request is evaluated concurrently with many other requests on the same GPU, and the bandwidth is utilized more efficiently