Massive Activation
Special token, delimiter, conjunction, preposition, first token, number token, weak semantics
This massive activation functions like a constant bias within the model, where setting it to 0 degrades performance while maintaining performance when fixing to mean value. During LayerNorm/RMSNorm, large values dominate the variance, causing the token's normalization vector to become distinctive and propagate through Q/K/V, inducing attention. Adding explicit attention bias parameters (k′, v′) during training eliminates the massive activation phenomenon while preserving model performance.
The Softmax Function tends to concentrate on specific areas when most logits are similar, typically focusing on the Beginning-of-Sequence (BOS) token which is considered in all positions.
Attention sink is an optimization fixed point for some heads and is reinforced by a combination of data, position, and model path biases. Practical mitigation strategies include per-head logit centering/scaling, sparse regularization (softpick), pruning/gating.
Previous research on KV-bias / attention bias (adding vectors to Q/K/V) has been summarized as not consistently effective in mitigating massive activations and attention sinks.
When it emerges (ICLR 2025 spotlight)
During the Softmax normalization process, internal dependencies are created between tokens, causing them to act like a bias on the Key side. They contribute very little to the actual meaningful values.