Attention Mechanism

Inter-Token Communication mechanism

Starting with the assumption that the latent vector just before a word's output would be similar to the vector right after its input

This technique was introduced to correct the decreased accuracy of output sequences when input sequences become longer, and does not consider order information

Attention is how much weight the query word should give each word in the sentence. This is computed via a dot product between the query vector and all the key vectors. These dot products then go through a softmax which makes the attention scores (across all keys) sum to 1.

Q - What am I looking for

K - What do I contain

V - What I communicate to another token

Attention Score (through Q, K) attention matrix

Attention Weight (softmax → attention distribution)

Attention Output

Attention output is softmax(QK)V

The attention mechanism inherently has an inductive bias toward sparse activation and the

Superposition Hypothesis, since it must focus on different tokens depending on the context.

Attention-Mechanism Notion

Attention Mechanism Type

Local/Global Attention

Soft/Hard Attention

Scaled Attention

Attention Mechanism usages

Attention Mechanism Optimization

Attention Mechanism abstraction

Reversing Transformer

Screening Mechanism

Andrej Karpathy denoted that

Attention is a communication mechanism. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.

There is no notion of space (position). Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.

Each example across batch dimension is of course processed completely independently and never "talk" to each other

In an "encoder" attention block just delete the single line that does masking with tril, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.

"self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)

"Scaled" attention additional divides wei by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below