Attention Weight

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Aug 21 16:39
Editor
Edited
Edited
2026 Mar 8 15:32
Refs
Refs

Attention Distribution, Attention Pattern

The attention pattern is a function of both the source and destination token. The attended token is ignored when calculating the attention pattern.

Set of softmax normalized
Attention Score

is set of attention scores (attention matrix)
Normalized so that the sum of attention scores equals 1
Represents a probability distribution indicating how much the model should focus on each part of the input data
Scores are converted to probabilities using softmax

Attention weight

The row value per token in the attention distribution, representing the actual weight applied to each part of the input data
Multiplied with values to generate a weighted sum (
Attention Output
)

THE FREEZING ATTENTION PATTERNS TRICK

Thinking of the OV and QK circuits separately can be very useful, since they're both individually functions we can understand (linear or bilinear functions operating on matrices we understand).
But is it really principled to think about them independently? One thought experiment which might be helpful is to imagine running the model twice. The first time you collect the attention patterns of each head. This only depends on the QK circuit. 14 The second time, you replace the attention patterns with the "frozen" attention patterns you collected the first time. This gives you a function where the logits are a linear function of the tokens! We find this a very powerful way to think about transformers.
 
 

Attention
Motif
attention-motifs
mivanitUpdated 2026 Mar 6 16:39

The idea is to classify attention heads not by which tokens they attend to, but by the shape of the attention pattern itself.
  1. Extract numerical features from visual motifs in the attention matrix, such as strong diagonals, first token sink, vertical bars, and recent-token bands.
  1. Use PCA to embed these features and define distances between attention patterns.
  1. Average patterns across multiple prompts to create head-to-head distances.
  1. Embed the heads using these distances, which shows that known classes like induction heads, name movers, and previous token heads cluster well together.
The claim is that head functionality similarity can be captured by pattern shape alone, without token semantics or residual features. This computation is very cheap and makes it easy to find similar heads across different models.
openreview.net
Embedding Explorer
 
 

Recommendations