Attention Sink

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Oct 14 12:48
Editor
Edited
Edited
2025 Nov 1 13:52

Anchor

Vertical strong line to
System Prompt
or
BOS Token
(>50%) in
Attention Score

LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers and ultimately focusing on critical tokens.

SINK TOKEN

Even though keeping the first token might seem semantically meaningless, it has significance. This is because due to the characteristics of the Attention Mechanism, the first token is used as an anchor for calculating the Attention Score through positional embedding. Therefore, even if it's semantically meaningless, the model structurally requires it.
Attention Sinks Notion
 
 
 

Attention sink

ICLR Poster Efficient Streaming Language Models with Attention Sinks
The ICLR Logo above may be used on presentations. Right-click and choose download. It is a vector graphic and may be used at any scale.
Efficient Streaming Language Models with Attention Sinks
Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly,...
Efficient Streaming Language Models with Attention Sinks
🕳️ Attention Sinks in LLMs for endless fluency
A Blog post by Tom Aarsen on Hugging Face
🕳️ Attention Sinks in LLMs for endless fluency

Null attention

arxiv.org

Prefix Usecase with quantization

Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization
Seungwoo Son, Wonpyo Park, Woohyun Han, Kyuyeun Kim, Jaeho Lee. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024.
Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization
 
 

Recommendations