Massive Activation
Special token, delimiter, conjunction, preposition, first token, number token, weak semantics
This massive activation functions like a constant bias within the model, where setting it to 0 degrades performance while maintaining performance when fixing to mean value. During LayerNorm/RMSNorm, large values dominate the variance, causing the token's normalization vector to become distinctive and propagate through Q/K/V, inducing attention. Adding explicit attention bias parameters (k′, v′) during training eliminates the massive activation phenomenon while preserving model performance.
arxiv.org
https://arxiv.org/pdf/2402.17762
The Softmax Function tends to concentrate on specific areas when most logits are similar, typically focusing on the Beginning-of-Sequence (BOS) token which is considered in all positions.
Attention sink is an optimization fixed point for some heads and is reinforced by a combination of data, position, and model path biases. Practical mitigation strategies include per-head logit centering/scaling, sparse regularization (softpick), pruning/gating.
Previous research on KV-bias / attention bias (adding vectors to Q/K/V) has been summarized as not consistently effective in mitigating massive activations and attention sinks.
arxiv.org
https://arxiv.org/pdf/2504.20966
When it emerges (ICLR 2025 spotlight)
During the Softmax normalization process, internal dependencies are created between tokens, causing them to act like a bias on the Key side. They contribute very little to the actual meaningful values.
arxiv.org
https://arxiv.org/pdf/2410.10781

Seonglae Cho