Attention Sink Emergence

Created

2025 Nov 1 13:51

Creator

Seonglae Cho

Editor

Seonglae Cho

Edited

2025 Nov 1 13:52

Refs

When it emerges (ICLR 2025 spotlight)

During the Softmax normalization process, internal dependencies are created between tokens, causing them to act like a bias on the Key side. They contribute very little to the actual meaningful values.

arxiv.org

https://arxiv.org/pdf/2410.10781

Attention sink is a method for LLMs to avoid over-mixing, information propagates in Transformers

www.arxiv.org

https://www.arxiv.org/pdf/2504.02732

Recommendations

///////////