Multi-head Attention

MHA

An important implementation detail that improves training.

How

In theory, the total computational cost is the same, but independence greatly increases expressiveness and stability. Mathematically equivalent and easier to parallelize, so computation is faster. The input vector is split into multiple heads, each performing attention independently, and then the results are combined. Typically, when multi-head attention is applied, instead of a single vector, the vector is segmented, so an output projection linear transformation is applied.

In other words, each head learns relationships in different subspaces, splitting into independent projections equal to the number of heads. While an attention map can only learn a single relationship pattern, multi-head attention can learn diverse aspects quantitatively, such as syntax, semantics, position, and co-reference.

In summary, this MHA a superior

Inductive Bias in terms of stability, and generalization

Result

Each head learn to look a different semantics simultaneously. Because we use the softmax function in attention, it amplifies the highest value while squashing the lower ones. As a result, each head tends to focus on a single element. Multiple heads let us attend to several words. It also provides redundancy, where if any single head fails, we have the other attention heads to rely on.