The Softmax Function tends to concentrate on specific areas when most logits are similar, typically focusing on the Beginning-of-Sequence (BOS) token which is considered in all positions.
Attention sink is an optimization fixed point for some heads and is reinforced by a combination of data, position, and model path biases. Practical mitigation strategies include per-head logit centering/scaling, sparse regularization (softpick), pruning/gating.
Previous research on KV-bias / attention bias (adding vectors to Q/K/V) has been summarized as not consistently effective in mitigating massive activations and attention sinks.