Softpick function

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Aug 1 16:53
Editor
Edited
Edited
2025 Aug 16 1:48
The
Softmax Function
tends to concentrate on specific areas when most logits are similar, typically focusing on the Beginning-of-Sequence (BOS) token which is considered in all positions.
Attention sink is an optimization fixed point for some heads and is reinforced by a combination of data, position, and model path biases. Practical mitigation strategies include per-head logit centering/scaling, sparse regularization (softpick), pruning/gating.
Previous research on KV-bias / attention bias (adding vectors to Q/K/V) has been summarized as not consistently effective in mitigating massive activations and attention sinks.
 
 
 

Recommendations