Grouped-query Attention

`GQA`

If these models used 4x GQA, the size and required bandwidth for KV-cache would have been 4x smaller. This can be applied retroactively as a generalization that includes MHA and MQA.

Instead of having unique key and value matrices for each attention head, MQA share a single key and value matrix across all heads. However this modification does impact model performance. However, GQA, instead of forcing all attention heads in a given layer to share the same key and value matrices (

Multi-query Attention), creates multiple groups of attention heads that share the same key and value matrices. MLA reduces KV cache size and at the same time improved the performance. What if the model could learn to efficiently compress its own keys and values. MLA adds an extra step between each attention head’s input and the key and value matrices. Then MLA projects the input into compressed shared latent space and the latent space projected back up to keys and value using another set of learned weights for each head. This is possible since the attention heads shares similar keys and values while sharing this in one latent space is efficient for reducing KV cache. Furthermore, this shared latent space approach results improved performance than non-shared model which may caused by a noise reduction effect of shared latent space.

KV Cache & GQA

It would be interesting to see the technical report as it may contain relevant ablation studies, but purely from the cost/performance point of view, GQA needs to be evaluated for every transformer-based LLM as the benefits are too significant to ignore.

When you ask ChatGPT to help with a task, your request is evaluated concurrently with many other requests on the same GPU, and the bandwidth is utilized more efficiently

LLM inference speed of light

In the process of working on calm, a minimal from-scratch fast CUDA implementation of transformer-based language model inference, a critical consideration was establishing the speed of light for the inference process, and measuring the progress relative to that speed of light. In this post we’ll cover this theoretical limit and its implications.

https://zeux.io/2024/03/15/llm-inference-sol