Softpick function

The

Softmax Function tends to concentrate on specific areas when most logits are similar, typically focusing on the Beginning-of-Sequence (BOS) token which is considered in all positions.

Attention sink is an optimization fixed point for some heads and is reinforced by a combination of data, position, and model path biases. Practical mitigation strategies include per-head logit centering/scaling, sparse regularization (softpick), pruning/gating.

Previous research on KV-bias / attention bias (adding vectors to Q/K/V) has been summarized as not consistently effective in mitigating massive activations and attention sinks.

arxiv.org

https://arxiv.org/pdf/2504.20966

Softpick function

Recommendations