Attention Distribution, Attention Pattern
The attention pattern is a function of both the source and destination token. The attended token is ignored when calculating the attention pattern.
Set of softmax normalized Attention Score
is set of attention scores (attention matrix)
attention score들의 합이 1이 되도록 정규화
모델이 입력 데이터의 어떤 부분에 얼마나 주목해야 하는지를 나타내는 확률 분포
score를 softmax로 확률화
Attention weight
attention distribution의 토큰단 행값으로, 입력 데이터의 각 부분에 실제로 적용되는 가중치입니다
value에 곱해져서 weighted sum(Attention Output)을 생성합니다
THE FREEZING ATTENTION PATTERNS TRICK
Thinking of the OV and QK circuits separately can be very useful, since they're both individually functions we can understand (linear or bilinear functions operating on matrices we understand).
But is it really principled to think about them independently? One thought experiment which might be helpful is to imagine running the model twice. The first time you collect the attention patterns of each head. This only depends on the QK circuit. 14 The second time, you replace the attention patterns with the "frozen" attention patterns you collected the first time. This gives you a function where the logits are a linear function of the tokens! We find this a very powerful way to think about transformers.