Gated Attention

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Apr 28 8:55
Editor
Edited
Edited
2025 Dec 5 11:25
Gated Attentions
 
 
 
 
 
 
Applying a head-specific sigmoid gate (G1) immediately after SDPA (Scaled Dot-Product Attention) output → improves performance (PPL↓, MMLU↑), stability↑, and allows higher learning rates↑.
After attention output, the gate (sigmoid) reduces unnecessary information to near 0 in the gating process, so attention doesn't need to concentrate on the same token (e.g., the first token). This works by adding nonlinearity → solving the expressiveness limitation of the existing continuous linear structure between Value (Wv) and Output (Wo), and Query-dependent sparsity → removing unnecessary information → eliminating massive activation and
Attention Sink
 
 

Recommendations