Gated Attention

Gated Attentions

Applying a head-specific sigmoid gate (G1) immediately after SDPA (Scaled Dot-Product Attention) output → improves performance (PPL↓, MMLU↑), stability↑, and allows higher learning rates↑.

After attention output, the gate (sigmoid) reduces unnecessary information to near 0 in the gating process, so attention doesn't need to concentrate on the same token (e.g., the first token). This works by adding nonlinearity → solving the expressiveness limitation of the existing continuous linear structure between Value (Wv) and Output (Wo), and Query-dependent sparsity → removing unnecessary information → eliminating massive activation and

Attention Sink

openreview.net

https://openreview.net/pdf?id=1b7whO4SfY

arxiv.org

https://arxiv.org/pdf/2209.10655

Gated Attention

Recommendations