Multi-token transcoder

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Oct 29 23:56
Editor
Edited
Edited
2025 Oct 30 0:17
Refs
Multi-Token Transcoder (MTC) is a multi-token-aware transcoder that approximates the entire attention layer as a function and simultaneously learns "what moves where".
  • Token-wise Encoder
  • Feature-specific Attention Operation
  • Destination-side Decoder
Attention requires both QK ("where to/from") and OV ("what") conditions to work together. MTC can model both simultaneously, allowing it to jointly learn "what is moved and where it is moved to". Traditional SAEs could only capture "what is moved" (i.e., SAEs only learn the OV-condition), and CLT similarly could not handle attention.

Attention Superposition Analysis

Results show that attention superposition exists (multiple heads partially represent the same feature). This provides strong evidence that attention features are distributed across multiple heads.

QK/OV Coupling

Semantic features corresponding to specific attention head clusters are formed, leading to the discovery of interpretable features. QK/OV coupling is confirmed. Interpretation is clearer and performance slightly better than SAE+attention-head combinations.

Multidimensional OV

Multidimensional OV (output space diversity) and head loading vector clustering follow a power-law distribution, suggesting a continuous spectrum of information movement rather than discrete "dimensions". The problem is explosive memory/computation requirements.

QK Diagonalization

A method that directly explains attention pattern formation, i.e., why the model attends to specific positions. Query (Q) and key (K) vectors are each passed through an encoder to create feature-level representations, and the query-side and key-side of feature i are constrained to interact only in a 1:1 manner (diagonalization). This means rank-1 QK features (vectors) → low-dimensional attention patterns, while high-rank QK features (matrices) → complex patterns. The rank-1 model struggles to perfectly reproduce base attention, so increasing the number of features does not improve MSE, and higher rank is needed. Therefore, the next step involves experimenting with a high-rank QK feature approach.

Attention Value SAE

Concatenate value vectors from multiple heads and train SAE. Cross-layer representations discovered. Analysis of attention information sharing between layers. Cross-Layer Representations achieved: simultaneous training of layer 4 & 10 value vectors → combined SAE for both layers shows superior MSE/L₀ performance.
 
 
 
 
 
 

Recommendations