Anchor
Vertical strong line to System Prompt or BOS Token (>50%) in Attention Score
LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers and ultimately focusing on critical tokens.
SINK TOKEN
Even though keeping the first token might seem semantically meaningless, it has significance. This is because due to the characteristics of the Attention Mechanism, the first token is used as an anchor for calculating the Attention Score through positional embedding. Therefore, even if it's semantically meaningless, the model structurally requires it.
Attention Sinks Notion
Attention sink
ICLR Poster Efficient Streaming Language Models with Attention Sinks
The ICLR Logo above may be used on presentations. Right-click and choose
download. It is a vector graphic and may be used at any scale.
https://iclr.cc/virtual/2024/poster/18794
Efficient Streaming Language Models with Attention Sinks
Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly,...
https://arxiv.org/abs/2309.17453

🕳️ Attention Sinks in LLMs for endless fluency
A Blog post by Tom Aarsen on Hugging Face
https://huggingface.co/blog/tomaarsen/attention-sinks
Null attention
Prefix Usecase with quantization
Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization
Seungwoo Son, Wonpyo Park, Woohyun Han, Kyuyeun Kim, Jaeho Lee. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024.
https://aclanthology.org/2024.emnlp-main.134/

Seonglae Cho