Attention Sink

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Oct 14 12:48
Editor
Edited
Edited
2026 Mar 27 15:44

Anchor

Vertical strong line to
System Prompt
or
BOS Token
(>50%) in
Attention Score

LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers and ultimately focusing on critical tokens.

SINK TOKEN

Even though keeping the first token might seem semantically meaningless, it has significance. This is because due to the characteristics of the Attention Mechanism, the first token is used as an anchor for calculating the Attention Score through positional embedding. Therefore, even if it's semantically meaningless, the model structurally requires it.
Attention Sinks Notion
 
 
 

Attention sink

ICLR Poster Efficient Streaming Language Models with Attention Sinks
The ICLR Logo above may be used on presentations. Right-click and choose download. It is a vector graphic and may be used at any scale.
Efficient Streaming Language Models with Attention Sinks
Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly,...
Efficient Streaming Language Models with Attention Sinks
🕳️ Attention Sinks in LLMs for endless fluency
A Blog post by Tom Aarsen on Hugging Face
🕳️ Attention Sinks in LLMs for endless fluency

Null attention

arxiv.org

Prefix Usecase with quantization

Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization
Seungwoo Son, Wonpyo Park, Woohyun Han, Kyuyeun Kim, Jaeho Lee. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024.
Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization
The core methodology is to mathematically analyze the mechanism that generates massive activations in pre-norm Transformers, and to run large-scale ablation experiments. When analyzing how massive activations arise in the SwiGLU feed-forward block, the SiLU activation can be approximated as for large inputs, so the feed-forward transform reduces to a quadratic form: . In this quadratic form, , the eigenvalue spectrum of exhibits a rank-one–dominated structure, causing inputs along a particular direction to be amplified extremely.
Normalization then turns this spike token into a sparse, near-constant vector. Under RMSNorm, , if one dimension dominates the overall norm, the remaining dimensions shrink to nearly zero, producing a sparse, near-constant representation. This constant-like vector acts as a stable reference point for query–key dot products, leading to the formation of an attention sink.
The paper’s conditional gating experiment is a key result showing that the attention sink is a learned implicit gating mechanism. In the context-length ablation, enforcing a minimum context of at least 1024 causes the sink ratio to rise to 13.0%, while enforcing at least 2048 causes it to collapse to 1.2%, confirming that the sink is fundamentally a mechanism for short-range prediction. A main limitation is that the experiments focus on a single 7B Llama-style architecture, so generalization to larger scales or other architectures (e.g., Mixture-of-Experts) is not validated.
arxiv.org
 
 

Recommendations