Multi-head Attention

Creator
Creator
Alan JoAlan Jo
Created
Created
2023 Mar 5 23:52
Editor
Editor
Alan JoAlan Jo
Edited
Edited
2024 Apr 14 9:32

MHA

Each head learn to look a different semantics simultaneously.

생각보다 학습이 잘되게 하는 아주 중요한 구현
Because we use the softmax function in attention, it amplifies the highest value while squashing the lower ones. As a result, each head tends to focus on a single element.
Multiple heads let us attend to several words. It also provides redundancy, where if any single head fails, we have the other attention heads to rely on.
입력 벡터를 여러 개의 헤드로 분할하여 각각 어텐션을 수행하고 결과를 결합하는 방식으로 동작
보통 multi-head 적용하면 하나의 벡터가 아니라 분절된 벡터라 output projection 선형변환 거침
notion image
Multi-head Attentions
 
 
 
 
 
 

Sparse

 
 

Recommendations