Multi-head Attention

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Mar 5 23:52
Editor
Edited
Edited
2025 Oct 22 22:24

MHA

An important implementation detail that improves training.

How

In theory, the total computational cost is the same, but independence greatly increases expressiveness and stability. Mathematically equivalent and easier to parallelize, so computation is faster. The input vector is split into multiple heads, each performing attention independently, and then the results are combined. Typically, when multi-head attention is applied, instead of a single vector, the vector is segmented, so an output projection linear transformation is applied.
In other words, each head learns relationships in different subspaces, splitting into independent projections equal to the number of heads. While an attention map can only learn a single relationship pattern, multi-head attention can learn diverse aspects quantitatively, such as syntax, semantics, position, and co-reference.
In summary, this MHA a superior
Inductive Bias
in terms of stability, and generalization

Result

Each head learn to look a different semantics simultaneously. Because we use the softmax function in attention, it amplifies the highest value while squashing the lower ones. As a result, each head tends to focus on a single element. Multiple heads let us attend to several words. It also provides redundancy, where if any single head fails, we have the other attention heads to rely on.
notion image
Multi-head Attentions
 
 
 
 
 
 
라마2에 적용된 추론 속도 향상 기술인 GQA(Grouped Query Attention)에 대해
데보션 (DEVOCEAN) 기술 블로그 , 개발자 커뮤니티이자 내/외부 소통과 성장 플랫폼
라마2에 적용된 추론 속도 향상 기술인 GQA(Grouped Query Attention)에 대해

Sparse

github.com
Some Intuition on Attention and the Transformer
What's the big deal, intuition on query-key-value vectors, multiple heads, multiple layers, and more.
Some Intuition on Attention and the Transformer
 
 

Recommendations