Attention Mechanism

Attention Mechanism

Creator
Creator
Alan JoAlan Jo
Created
Created
2023 Mar 5 23:49
Editor
Editor
Alan JoAlan Jo
Edited
Edited
2024 May 27 9:37

Inter-Token Communication mechanism

해당 단어의 출력 직전의 Latent vector는 입력 직후의 벡터와 유사할 것이다라는 가정 하에 시작
입력 시퀀스가 길어지면 출력 시퀀스의 정확도가 떨어지는 것을 보정해주기 위한 등장한 기법으로 순서정보가 고려되지 않는다
Attention is how much weight the query word should give each word in the sentence. This is computed via a dot product between the query vector and all the key vectors. These dot products then go through a softmax which makes the attention scores (across all keys) sum to 1.
  1. Q - What am I looking for
  1. K - What do I contain
  1. V - What I communicate to another token
  1. Attention Score
    (through Q, K)
  1. Attention Distribution
    (attention weight)
  1. Attention Value
    (attention weight, V)
  1. Attention output is softmax(QK)V
Attention-Mechanism Notion
 
 
 
Attention Mechanism usages
 
 
 
 

Andrej Karpathy
denoted that

  • Attention is a communication mechanism. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
  • There is no notion of space (position). Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
  • Each example across batch dimension is of course processed completely independently and never "talk" to each other
  • In an "encoder" attention block just delete the single line that does masking with tril, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
  • "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
  • "Scaled" attention additional divides wei by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below
 
 
 
 
 

Paper (2014)
ICLR
2015 (30000+ citations)

NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE (Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio)

Visualization

Process

 
 
 

Recommendations