Attention Sink

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Oct 14 12:48
Editor
Edited
Edited
2024 Nov 27 15:4

Vertical strong line to
System Prompt
or
BOS Token
(>50%) in
Attention Score

LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers and ultimately focusing on critical tokens.

SINK TOKEN

Even though keeping the first token might seem semantically meaningless, it has significance. This is because due to the characteristics of the Attention Mechanism, the first token is used as an anchor for calculating the Attention Score through positional embedding. Therefore, even if it's semantically meaningless, the model structurally requires it.
Attention Sinks Notion
 
 
 

Attention sink

Null attention

Massive Activation

Special token, delimiter, conjunction, preposition, first token, number token, weak semantics

When it emerges

 
 

Recommendations