Attention Distribution

Attention Pattern, Attention Weight

The attention pattern is a function of both the source and destination token. The attended token is ignored when calculating the attention pattern.

Set of softmax normalized
Attention Score

is set of attention scores (attention matrix)

attention score들의 합이 1이 되도록 정규화

모델이 입력 데이터의 어떤 부분에 얼마나 주목해야 하는지를 나타내는 확률 분포

score를 softmax로 확률화

Attention weight

attention distribution의 토큰단 행값으로, 입력 데이터의 각 부분에 실제로 적용되는 가중치입니다

value에 곱해져서 weighted sum(

Attention Value)을 생성합니다

THE FREEZING ATTENTION PATTERNS TRICK

Thinking of the OV and QK circuits separately can be very useful, since they're both individually functions we can understand (linear or bilinear functions operating on matrices we understand).

But is it really principled to think about them independently? One thought experiment which might be helpful is to imagine running the model twice. The first time you collect the attention patterns of each head. This only depends on the QK circuit. 14 The second time, you replace the attention patterns with the "frozen" attention patterns you collected the first time. This gives you a function where the logits are a linear function of the tokens! We find this a very powerful way to think about transformers.

Attention Distribution

Attention Pattern, Attention Weight

Set of softmax normalized Attention Score

Attention weight

THE FREEZING ATTENTION PATTERNS TRICK

Recommendations

Set of softmax normalized
Attention Score