When it emerges (ICLR 2025 spotlight)
During the Softmax normalization process, internal dependencies are created between tokens, causing them to act like a bias on the Key side. They contribute very little to the actual meaningful values.
Attention sink is a method for LLMs to avoid over-mixing, information propagates in Transformers

Seonglae Cho