MHA
An important implementation detail that improves training.
How
In theory, the total computational cost is the same, but independence greatly increases expressiveness and stability. Mathematically equivalent and easier to parallelize, so computation is faster. The input vector is split into multiple heads, each performing attention independently, and then the results are combined. Typically, when multi-head attention is applied, instead of a single vector, the vector is segmented, so an output projection linear transformation is applied.
In other words, each head learns relationships in different subspaces, splitting into independent projections equal to the number of heads. While an attention map can only learn a single relationship pattern, multi-head attention can learn diverse aspects quantitatively, such as syntax, semantics, position, and co-reference.
In summary, this MHA a superior Inductive Bias in terms of stability, and generalization
Result
Each head learn to look a different semantics simultaneously. Because we use the softmax function in attention, it amplifies the highest value while squashing the lower ones. As a result, each head tends to focus on a single element. Multiple heads let us attend to several words. It also provides redundancy, where if any single head fails, we have the other attention heads to rely on.

Multi-head Attentions
라마2에 적용된 추론 속도 향상 기술인 GQA(Grouped Query Attention)에 대해
데보션 (DEVOCEAN) 기술 블로그 , 개발자 커뮤니티이자 내/외부 소통과 성장 플랫폼
https://devocean.sk.com/blog/techBoardDetail.do?ID=165192&boardType=techBlog

Sparse
github.com
https://github.com/facebookresearch/fairseq/blob/main/fairseq/modules/sparse_multihead_attention.py
Some Intuition on Attention and the Transformer
What's the big deal, intuition on query-key-value vectors, multiple heads, multiple layers, and more.
https://eugeneyan.com/writing/attention/


Seonglae Cho