Attention Sink

Creator

Creator

Seonglae Cho

Created

Created

2023 Oct 14 12:48

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Nov 1 13:52

Refs

Refs

attention_sinks

tomaarsen • Updated 2023 Oct 14 10:34

Continual Learning

Sparse Attention

Anchor

Vertical strong line to
System Prompt or
BOS Token (>50%) in
Attention Score

LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers and ultimately focusing on critical tokens.

SINK TOKEN

Even though keeping the first token might seem semantically meaningless, it has significance. This is because due to the characteristics of the Attention Mechanism, the first token is used as an anchor for calculating the Attention Score through positional embedding. Therefore, even if it's semantically meaningless, the model structurally requires it.

Attention Sinks Notion

Attention Sink Emergence

REGISTERS Tokens

Attention Canceling

Attention Sink Removal

Visual Attention Sink

Attention sink

ICLR Poster Efficient Streaming Language Models with Attention Sinks

The ICLR Logo above may be used on presentations. Right-click and choose download. It is a vector graphic and may be used at any scale.

ICLR Poster Efficient Streaming Language Models with Attention Sinks

https://iclr.cc/virtual/2024/poster/18794

Efficient Streaming Language Models with Attention Sinks

Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly,...

https://arxiv.org/abs/2309.17453

🕳️ Attention Sinks in LLMs for endless fluency

A Blog post by Tom Aarsen on Hugging Face

🕳️ Attention Sinks in LLMs for endless fluency

https://huggingface.co/blog/tomaarsen/attention-sinks

🕳️ Attention Sinks in LLMs for endless fluency

Null attention

https://arxiv.org/pdf/1906.04284

Prefix Usecase with quantization

Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization

Seungwoo Son, Wonpyo Park, Woohyun Han, Kyuyeun Kim, Jaeho Lee. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024.

https://aclanthology.org/2024.emnlp-main.134/

Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization

Backlinks

Autoregressive Model Model Layer Scaling Weight Interpretability Logit Lens AI Reasoning Feature Next Token Prediction Repeated Token Phenomenon AI Circuit Gated Attention Attention Mechanism Optimization Layer Normalization Model Layer Scaling

Recommendations

//////////