Multi-token transcoder

Multi-Token Transcoder (MTC) is a multi-token-aware transcoder that approximates the entire attention layer as a function and simultaneously learns "what moves where".

Token-wise Encoder

Feature-specific Attention Operation

Destination-side Decoder

Attention requires both QK ("where to/from") and OV ("what") conditions to work together. MTC can model both simultaneously, allowing it to jointly learn "what is moved and where it is moved to". Traditional SAEs could only capture "what is moved" (i.e., SAEs only learn the OV-condition), and CLT similarly could not handle attention.

Attention Superposition Analysis

Results show that attention superposition exists (multiple heads partially represent the same feature). This provides strong evidence that attention features are distributed across multiple heads.

QK/OV Coupling

Semantic features corresponding to specific attention head clusters are formed, leading to the discovery of interpretable features. QK/OV coupling is confirmed. Interpretation is clearer and performance slightly better than SAE+attention-head combinations.

Multidimensional OV

Multidimensional OV (output space diversity) and head loading vector clustering follow a power-law distribution, suggesting a continuous spectrum of information movement rather than discrete "dimensions". The problem is explosive memory/computation requirements.

QK Diagonalization

A method that directly explains attention pattern formation, i.e., why the model attends to specific positions. Query (Q) and key (K) vectors are each passed through an encoder to create feature-level representations, and the query-side and key-side of feature i are constrained to interact only in a 1:1 manner (diagonalization). This means rank-1 QK features (vectors) → low-dimensional attention patterns, while high-rank QK features (matrices) → complex patterns. The rank-1 model struggles to perfectly reproduce base attention, so increasing the number of features does not improve MSE, and higher rank is needed. Therefore, the next step involves experimenting with a high-rank QK feature approach.

Attention Value SAE

Concatenate value vectors from multiple heads and train SAE. Cross-layer representations discovered. Analysis of attention information sharing between layers. Cross-Layer Representations achieved: simultaneous training of layer 4 & 10 value vectors → combined SAE for both layers shows superior MSE/L₀ performance.

Progress on Attention

Last year, we announced that studying attention superposition was one of our team’s priorities, motivated in part by a toy model demonstrating attention features in superposition. Since then, we’ve tried several ideas to study this problem in real language models, and we’d like to give an update on what we’ve tried and where we think the problem is now.

https://transformer-circuits.pub/2025/attention-update/index.html