Attention Distribution

Creator
Creator
Alan JoAlan Jo
Created
Created
2023 Aug 21 16:39
Editor
Editor
Alan JoAlan Jo
Edited
Edited
2024 Jun 30 4:31
Refs
Refs

Attention Pattern, Attention Weight

The attention pattern is a function of both the source and destination token. The attended token is ignored when calculating the attention pattern.

Set of softmax normalized
Attention Score

is set of attention scores (attention matrix)
attention score들의 합이 1이 되도록 정규화
모델이 입력 데이터의 어떤 부분에 얼마나 주목해야 하는지를 나타내는 확률 분포
score를 softmax로 확률화
 

Attention weight

attention distribution의 토큰단 행값으로, 입력 데이터의 각 부분에 실제로 적용되는 가중치입니다
value에 곱해져서 weighted sum(
Attention Value
)을 생성합니다
 
 
 

THE FREEZING ATTENTION PATTERNS TRICK

Thinking of the OV and QK circuits separately can be very useful, since they're both individually functions we can understand (linear or bilinear functions operating on matrices we understand).
But is it really principled to think about them independently? One thought experiment which might be helpful is to imagine running the model twice. The first time you collect the attention patterns of each head. This only depends on the QK circuit. 14 The second time, you replace the attention patterns with the "frozen" attention patterns you collected the first time. This gives you a function where the logits are a linear function of the tokens! We find this a very powerful way to think about transformers.
 
 
 
 

Recommendations