Gated Attentions
Applying a head-specific sigmoid gate (G1) immediately after SDPA (Scaled Dot-Product Attention) output → improves performance (PPL↓, MMLU↑), stability↑, and allows higher learning rates↑.
After attention output, the gate (sigmoid) reduces unnecessary information to near 0 in the gating process, so attention doesn't need to concentrate on the same token (e.g., the first token). This works by adding nonlinearity → solving the expressiveness limitation of the existing continuous linear structure between Value (Wv) and Output (Wo), and Query-dependent sparsity → removing unnecessary information → eliminating massive activation and Attention Sink
openreview.net
https://openreview.net/pdf?id=1b7whO4SfY
arxiv.org
https://arxiv.org/pdf/2209.10655

Seonglae Cho