Attention Distribution, Attention Pattern
The attention pattern is a function of both the source and destination token. The attended token is ignored when calculating the attention pattern.
Set of softmax normalized Attention Score
is set of attention scores (attention matrix)
Normalized so that the sum of attention scores equals 1
Represents a probability distribution indicating how much the model should focus on each part of the input data
Scores are converted to probabilities using softmax
Attention weight
The row value per token in the attention distribution, representing the actual weight applied to each part of the input data
Multiplied with values to generate a weighted sum (Attention Output)
THE FREEZING ATTENTION PATTERNS TRICK
Thinking of the OV and QK circuits separately can be very useful, since they're both individually functions we can understand (linear or bilinear functions operating on matrices we understand).
But is it really principled to think about them independently? One thought experiment which might be helpful is to imagine running the model twice. The first time you collect the attention patterns of each head. This only depends on the QK circuit. 14 The second time, you replace the attention patterns with the "frozen" attention patterns you collected the first time. This gives you a function where the logits are a linear function of the tokens! We find this a very powerful way to think about transformers.
Attention Motif attention-motifsmivanit • Updated 2026 Mar 6 16:39
attention-motifs
mivanit • Updated 2026 Mar 6 16:39
The idea is to classify attention heads not by which tokens they attend to, but by the shape of the attention pattern itself.
- Extract numerical features from visual motifs in the attention matrix, such as strong diagonals, first token sink, vertical bars, and recent-token bands.
- Use PCA to embed these features and define distances between attention patterns.
- Average patterns across multiple prompts to create head-to-head distances.
- Embed the heads using these distances, which shows that known classes like induction heads, name movers, and previous token heads cluster well together.
The claim is that head functionality similarity can be captured by pattern shape alone, without token semantics or residual features. This computation is very cheap and makes it easy to find similar heads across different models.
openreview.net
https://openreview.net/pdf?id=ND2WsCwlDQ
Embedding Explorer
https://attention-motifs.github.io/v1/vis/embeds/patterns/index.html?defaultColorColumn=activation.cls&selectedValues=pythia-1b%2Cgpt2-medium%2CLlama-3-2-1B%2Cgemma-2-2b&camera.position.x=1044.4373571104884&camera.position.y=-1502.106420998174&camera.position.z=51.849397844306694&camera.rotation.pitch=0.6367963267948957&camera.rotation.yaw=-20.49999999999998&camera.rotation.roll=7.7999999999999226

Seonglae Cho