Vertical strong line to System Prompt or BOS Token (>50%) in Attention Score
LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers and ultimately focusing on critical tokens.
SINK TOKEN
Even though keeping the first token might seem semantically meaningless, it has significance. This is because due to the characteristics of the Attention Mechanism, the first token is used as an anchor for calculating the Attention Score through positional embedding. Therefore, even if it's semantically meaningless, the model structurally requires it.
Attention Sinks Notion
Attention sink
Null attention
Massive Activation
Special token, delimiter, conjunction, preposition, first token, number token, weak semantics