Attention Sink Removal

Massive Activation

Special token, delimiter, conjunction, preposition, first token, number token, weak semantics

This massive activation functions like a constant bias within the model, where setting it to 0 degrades performance while maintaining performance when fixing to mean value. During LayerNorm/RMSNorm, large values dominate the variance, causing the token's normalization vector to become distinctive and propagate through Q/K/V, inducing attention. Adding explicit attention bias parameters (k′, v′) during training eliminates the massive activation phenomenon while preserving model performance.

arxiv.org

https://arxiv.org/pdf/2402.17762

The

Softmax Function tends to concentrate on specific areas when most logits are similar, typically focusing on the Beginning-of-Sequence (BOS) token which is considered in all positions.

Attention sink is an optimization fixed point for some heads and is reinforced by a combination of data, position, and model path biases. Practical mitigation strategies include per-head logit centering/scaling, sparse regularization (softpick), pruning/gating.

Previous research on KV-bias / attention bias (adding vectors to Q/K/V) has been summarized as not consistently effective in mitigating massive activations and attention sinks.

arxiv.org

https://arxiv.org/pdf/2504.20966

When it emerges (ICLR 2025 spotlight)

During the Softmax normalization process, internal dependencies are created between tokens, causing them to act like a bias on the Key side. They contribute very little to the actual meaningful values.

arxiv.org

https://arxiv.org/pdf/2410.10781

Attention Sink Removal

Massive Activation

When it emerges (ICLR 2025 spotlight)

Backlinks

Recommendations