Anchor
Vertical strong line to System Prompt or BOS Token (>50%) in Attention Score
LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers and ultimately focusing on critical tokens.
SINK TOKEN
Even though keeping the first token might seem semantically meaningless, it has significance. This is because due to the characteristics of the Attention Mechanism, the first token is used as an anchor for calculating the Attention Score through positional embedding. Therefore, even if it's semantically meaningless, the model structurally requires it.
Attention Sinks Notion
Attention sink
ICLR Poster Efficient Streaming Language Models with Attention Sinks
The ICLR Logo above may be used on presentations. Right-click and choose
download. It is a vector graphic and may be used at any scale.
https://iclr.cc/virtual/2024/poster/18794
Efficient Streaming Language Models with Attention Sinks
Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly,...
https://arxiv.org/abs/2309.17453

🕳️ Attention Sinks in LLMs for endless fluency
A Blog post by Tom Aarsen on Hugging Face
https://huggingface.co/blog/tomaarsen/attention-sinks
Null attention
arxiv.org
https://arxiv.org/pdf/1906.04284
Prefix Usecase with quantization
Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization
Seungwoo Son, Wonpyo Park, Woohyun Han, Kyuyeun Kim, Jaeho Lee. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024.
https://aclanthology.org/2024.emnlp-main.134/
The core methodology is to mathematically analyze the mechanism that generates massive activations in pre-norm Transformers, and to run large-scale ablation experiments. When analyzing how massive activations arise in the SwiGLU feed-forward block, the SiLU activation can be approximated as for large inputs, so the feed-forward transform reduces to a quadratic form: . In this quadratic form, , the eigenvalue spectrum of exhibits a rank-one–dominated structure, causing inputs along a particular direction to be amplified extremely.
Normalization then turns this spike token into a sparse, near-constant vector. Under RMSNorm, , if one dimension dominates the overall norm, the remaining dimensions shrink to nearly zero, producing a sparse, near-constant representation. This constant-like vector acts as a stable reference point for query–key dot products, leading to the formation of an attention sink.
The paper’s conditional gating experiment is a key result showing that the attention sink is a learned implicit gating mechanism. In the context-length ablation, enforcing a minimum context of at least 1024 causes the sink ratio to rise to 13.0%, while enforcing at least 2048 causes it to collapse to 1.2%, confirming that the sink is fundamentally a mechanism for short-range prediction. A main limitation is that the experiments focus on a single 7B Llama-style architecture, so generalization to larger scales or other architectures (e.g., Mixture-of-Experts) is not validated.
arxiv.org
https://arxiv.org/pdf/2603.05498

Seonglae Cho