Multi-head Attention

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Mar 5 23:52
Editor
Edited
Edited
2025 Oct 22 22:24

MHA

An important implementation detail that improves training.

How

In theory, the total computational cost is the same, but independence greatly increases expressiveness and stability. Mathematically equivalent and easier to parallelize, so computation is faster. The input vector is split into multiple heads, each performing attention independently, and then the results are combined. Typically, when multi-head attention is applied, instead of a single vector, the vector is segmented, so an output projection linear transformation is applied.
In other words, each head learns relationships in different subspaces, splitting into independent projections equal to the number of heads. While an attention map can only learn a single relationship pattern, multi-head attention can learn diverse aspects quantitatively, such as syntax, semantics, position, and co-reference.
In summary, this MHA a superior
Inductive Bias
in terms of stability, and generalization

Result

Each head learn to look a different semantics simultaneously. Because we use the softmax function in attention, it amplifies the highest value while squashing the lower ones. As a result, each head tends to focus on a single element. Multiple heads let us attend to several words. It also provides redundancy, where if any single head fails, we have the other attention heads to rely on.
notion image
Multi-head Attentions
 
 
 
 
 
 

Sparse

 
 

Recommendations