Attention head

Attention heads move information from the residual stream of one token to another.

Attention head’s mathematical representation

weights for a specific head $h$ $W^h_{QK}$

Output of attention layer is sum multiplication of each heads’ result vector and weight $W_O^H = [W_O^{h_1}, W_O^{h_2}, \dots] \cdot \begin {bmatrix} r^{h_1} \\ r^{h_2} \\ \cdots \end {bmatrix} = \sum_i W_O^{h_i} r^{h_i}$

k_i = W_Kx_i, v_i = W_Vx_i, r_i = \sum_j A_{i, j} v_j, h(x)_i = W_O r_i

Attention Score $A_{i, j} = softmax(x^T W_Q^T W_Kx)$

Because $W_OW_V$ and $W_QW_K$ always operate together, we like to define variables representing these combined matrices, $W_{OV} = W_OW_V$ and $W_{QK} = W^T_QW^K$ .

Tensor Product representation

One-Layer Attention-Only Transformers example (without positional info)

A^h \otimes (W_U W_{OV}^hW_E), A^h = softmax^*(t^T \cdot W^T_EW^h_{QK}W_E \cdot t)

Attention head Properties

Attention heads can be understood as independent operations, each outputting a result which is added into the
Residual Stream

Attention Heads are Independent and Additive

Attention Heads as Information Movement

Every attention head reads in subspaces of the residual stream determined by

W_Q, W_K, W_V

and then writes to some subspace determined by

W_O

QK, OV matrix (within single head)

Attention heads can be understood as having two largely independent computations.

The OV and QK matrices are extremely low-rank. Copying behavior is widespread in OV matrices and arguably one of the most interesting behaviors. (for shifting and induction head). One-layer transformer models represent skip-trigrams in a "factored form" split between the OV and QK matrices. It's kind of like representing a function

f(a,b,c) = f_1(a,b) f_2(a,c)

. They can't really capture the three way interactions flexibly.

The point to understand about the Circuit is that the tokens are made up of a source and a destination, as follows.

Previous Token Head (source attention) → Induction head (destination attention)

The attention pattern is a function of both the source and destination token, but once a destination token has decided how much to attend to a source token, the effect on the output is solely a function of that source token.

1. QK Circuit

How each attention head's attention pattern is computed (same pattern matching)

preceding tokens → attended token

In fact, information about the attended token itself is quite irrelevant to calculating the attention pattern for induction. Note that the attended token is only ignored when calculating the attention pattern through the QK-circuit. Attended token is extremely important for calculating the head’s output through the OV-circuit! (The parts of the head that calculate the attention pattern, and the output if attended to, are separable and are often useful to consider independently)

2. OV Circuit

Copying is done by the OV ("Output-Value") circuit.

Transformers seem to have quite a number of copying head (

Attention head), of which induction heads are a subset.

Path Expansion Trick for Multi-layer Attention with composition

More complex QK circuit terms can be used to create induction heads which match on more than just the preceding token. The most basic form of an induction head uses pure K-composition with an earlier “previous token head” to create a QK-Circuit term of the form

Id ⊗h_{prev} ⊗ W

where

W

has positive

Eigenvalues. This term causes the induction head to compare the current token with every earlier position's preceding token and look for places where they're similar. More complex QK circuit terms can be used to create induction heads which match on more than just the preceding token.

Although it is not clearly stated in the paper, in the case of a specific form of single layer, or in the case of multi-layer where the latent space residual stream is altered by token embedding or Q,K-composition, the induction head with a similar

Eigenvector increases the token distribution probability.

Token Definitions

The QK circuit determines which "source" token the present "destination" token attends back to and copies information from, while the OV circuit describes what the resulting effect on the "out" predictions for the next token is.

[source]... [destination][out]

preceding tokens - attention pattern is a function of all possible source tokens from the start to the destination token.

source token - attended token is a specific previous token which induction head attended to. Attended token needs to contain information about the preceding tokens from what information is read.

destination token - current token where information is written

output token - predicted token which are similar with source token after destination token

Composition

One layer model copying head: [b] … [a] → [b]

And when rare quirks of tokenization allow: [ab] … [a] → [b]

Two layer model induction head: [a][b] … [a] → [b]

For the next layer QK-circuit, both Q-composition and K-composition come into play, with previous layer attention heads potentially influencing the construction of the keys and queries

Compositions of Attention

Q-Composition: $W_Q$ reads in a subspace affected by a previous token head.

K-Composition: $W_K$ reads in a subspace affected by a previous token head. (key shifting)

Induction heads K reads from a subspace written to by an earlier attention head

V-Composition: $W_V$ reads in a subspace affected by a previous token head.

Q- and K-Composition are quite different from V-Composition. Q and K-Composition both affect the attention pattern, allowing attention heads to express much more complex patterns. V-Composition, on the other hand, affects what information an attention head moves when it attends to a given position; the result is that V-composed heads really act more like a single unit and can be thought of as creating an additional "virtual attention heads" like

Superposition Hypothesis. The virtual attention heads has much larger effect when the depth of layer increases.

Larger models have more heads, which gives them more capacity for other interesting Q-composition and K-composition mechanisms that small models can’t afford to express. If all “composition heads” form simultaneously during the phase change, then it’s possible that above some size, non-induction composition heads could together account for more of the phase change and in-context learning improvement than induction heads do.

(Literal) Copying head

Does the head’s direct effect on the residual stream increase the logits of the same token as the one being attended to?

The only other potential contender for driving in-context learning in two-layer attention only models would be basic copying heads. However, basic copying heads also exist in one-layer models, which don't have the greatly increased in-context learning we see in two-layer models. Further, induction heads just seem conceptually more powerful.

Transformers seem to have quite a number of copying head, of which induction heads are a subset. This is done by having a "copying matrix" OV circuit, most easily characterized by its positive eigenvalues.

In larger models we often observe attention heads which "copy" some mixture of gender, plurality, and tense from nearby words, helping the model use the correct pronouns and conjugate verbs. So copying is actually a more complex concept than it might first appear.

One natural approach might be to use eigenvectors and eigenvalues. Let's consider what that means for an OV circuit

M = W_UW_{OV}^h W_E

\lambda_i

(for

Mv_i = \lambda_i v_i

) is a positive real number. Then we're saying that there's a linear combination of tokens which increases the linear combination of logits of those same tokens. Very roughly you could think of this as a set of tokens which mutually increase their own probability.

Copying requires positive eigenvalues, and indeed we observe that many attention heads have positive eigenvalues, apparently mirroring the copying structure:

It appears that 10 out of 12 heads are significantly copying! (This agrees with qualitative inspection of the expanded weights.) But while copying matrices must have positive eigenvalues, it isn't clear that all matrices with positive eigenvalues are things we necessarily want to consider to be copying. A matrix's eigenvectors aren't necessarily orthogonal, and this allows for pathological examples.

Prefix-matching head

On repeated sequences of random tokens, does the head attend to earlier tokens that are followed by a token that matches the present token?

Previous token attention

Does the head attend to the token that immediately precedes the present token?

Virtual attention heads

Products of attention heads behave much like attention heads themselves. By the distributive property or

Tensor Product. The result of this product can be seen as functionally equivalent to an attention head, with an attention pattern which is the composition of the two heads. We call these “virtual attention heads”. The virtual attention heads has much larger effect when the depth of layer increases.

Notation

https://transformer-circuits.pub/2021/framework/index.html#variable-definitions

Copy Suppression

Copy Suppression Preserving Ablation (CSPA) proved that 76.9% is attributed to copy suppression

arxiv.org

https://arxiv.org/pdf/2310.04625