Reversing Transformer

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Apr 17 13:56
Editor
Edited
Edited
2024 Nov 20 0:8

A transformer starts with a token embedding, followed by a series of residual blocks, and finally a token unembedding

Both the attention and MLP layers each “read” their input from the residual stream (by performing a linear projection), and then “write” their result to the residual stream by adding a linear projection back in.
  • Transformers have an enormous amount of linear structure.
  • One layer attention-only transformers are an ensemble of bigram and “skip-trigram
  • Two layer attention-only transformers can implement much more complex algorithms using compositions of attention heads
When there are many equivalent ways to represent the same computation, it is likely that the most human-interpretable representation and the most computationally efficient representation will be different. Composition of attention heads is the key difference between one-layer and two-layer attention-only transformers
  • logit
  • token vectot
  • embedding matrix
  • MLP output
  • Attention head
  • token unembedding
notion image
 
Reversing Transformer Notion
 

QK, OV matrix (within single head)

Attention heads can be understood as having two largely independent computations.
The OV and QK matrices are extremely low-rank. Copying behavior is widespread in OV matrices and arguably one of the most interesting behaviors. (for shifting and induction head). One-layer transformer models represent skip-trigrams in a "factored form" split between the OV and QK matrices. It's kind of like representing a function . They can't really capture the three way interactions flexibly.
The point to understand about the Circuit is that the tokens are made up of a source and a destination, as follows.
notion image
Previous Token Head (source attention) → Induction head (destination attention)
The attention pattern is a function of both the source and destination token, but once a destination token has decided how much to attend to a source token, the effect on the output is solely a function of that source token.

1. QK Circuit

How each attention head's attention pattern is computed (same pattern matching)
  • preceding tokens → attended token
In fact, information about the attended token itself is quite irrelevant to calculating the attention pattern for induction. Note that the attended token is only ignored when calculating the attention pattern through the QK-circuit. Attended token is extremely important for calculating the head’s output through the OV-circuit! (The parts of the head that calculate the attention pattern, and the output if attended to, are separable and are often useful to consider independently)

2. OV Circuit

Copying is done by the OV ("Output-Value") circuit. 
Transformers seem to have quite a number of copying head (
Attention head
), of which induction heads are a subset.
notion image

Path Expansion Trick for Multi-layer Attention with composition

notion image
More complex QK circuit terms can be used to create induction heads which match on more than just the preceding token. The most basic form of an induction head uses pure K-composition with an earlier “previous token head” to create a QK-Circuit term of the form where has positive
Eigenvalue
s. This term causes the induction head to compare the current token with every earlier position's preceding token and look for places where they're similar. More complex QK circuit terms can be used to create induction heads which match on more than just the preceding token.
Although it is not clearly stated in the paper, in the case of a specific form of single layer, or in the case of multi-layer where the latent space residual stream is altered by token embedding or Q,K-composition, the induction head with a similar
Eigenvector
increases the token distribution probability.

Token Definitions

The QK circuit determines which "source" token the present "destination" token attends back to and copies information from, while the OV circuit describes what the resulting effect on the "out" predictions for the next token is.
[source]... [destination][out]
  • preceding tokens - attention pattern is a function of all possible source tokens from the start to the destination token.
  • source token - attended token is a specific previous token which induction head attended to. Attended token needs to contain information about the preceding tokens from what information is read.
  • destination token - current token where information is written
  • output token - predicted token which are similar with source token after destination token

Composition

  • One layer model copying head: [b] … [a] → [b]
    • And when rare quirks of tokenization allow: [ab] … [a] → [b]
  • Two layer model induction head: [a][b] … [a] → [b]
For the next layer QK-circuit, both Q-composition and K-composition come into play, with previous layer attention heads potentially influencing the construction of the keys and queries
 
 
 
 

Attention-only transformers, which don't have MLP layers

Because they had much less success in understanding MLP layers so far (2021)
 
 

Recommendations