Transcoder based Attribution tracing
- CLT compresses amplification chains across multiple MLP layers into a single feature representation, resulting in shorter path lengths in the graph
- CLT globally optimizes MLP outputs across layers (joint training), achieving lower MSE than PLT while explaining much more variance than thresholded neurons
Overall Pipeline
- Build a replacement model by substituting the original model's MLP with a "Cross-Layer Transcoder (CLT)"
- Visualize the computational flow of the replacement model as an "attribution graph" for specific prompts
- Identify critical paths in the graph
- Verify the causal role of individual features using "patching" techniques
Local Replacement Model
To achieve 100% output matching with the original model for the given prompt:
- Replace all MLP subblocks in attention blocks with CLT
- Freeze the attention patterns and layer-norm denominators from the original model
- Correct node output differences (errors) for each token and layer using error nodes
1. Attribution graph (inference time Local graph)
Visualize causal flows between features. Due to thousands or tens of thousands of edges being generated in the attribution graph, we only keep the paths that "contribute most significantly" to the model output (logit).
Nodes
The graph contains Embedding nodes, Feature nodes, Error nodes, and Logit nodes. Edges indicate direct contributions calculated as source node value × (linear) weight. Edge weights are categorized into two types: Residual-direct paths and Attention-mediated paths, distinguishing between connections through residual connections versus those through attention and OV circuits. In the Local graph, the original model's attention patterns(QK) are frozen to include the OV(output→value) stage, allowing us to track "which token positions, through which features, contributed to which token predictions." This enables visualization of how specific attention heads move information to particular features in a given prompt.
In contrast, the global graph only measures residual-direct paths (CLT decoder→residual→CLT encoder) since attention patterns (QK) change with every context. Attention-mediated paths are excluded from global analysis because they vary depending on the context.
Importance Score
Let be the normalized absolute value Adjacency matrix containing "first-order (direct edge) influences". We can model the cumulative contributions of all causal interactions between graph nodes through paths of length k. Using Neumann series
The final influence matrix excluding self-influence can be calculated as follows:
Attribution Graph Pruning
- Create matrix A by taking absolute values of direct contributions (edges) between nodes (token embeddings, features, error nodes) and normalizing so incoming edges to each node sum to 1.
- Calculate indirect contributions using , where B contains the summed influence of paths of all lengths.
- Calculate influence scores by taking weighted averages of rows in B connected to logit nodes (e.g. final prediction tokens).
- Perform "pruning" by removing low-importance nodes and edges, typically preserving 80-90% of total influence while reducing nodes to ~1/10th.
Causal Role Verification
- First, run the base model to record MLP outputs at each layer and store the decoder contributions of CLT features.
- In the second forward pass, multiply all decoder contributions of selected features by a coefficient and replace with "recorded MLP output + ".
- Analyze logit changes by considering suppression, removal, and amplification results from patching.
Constrained Patching modifies CLT decoder contributions only for specific layer ranges and fixes subsequent layers. This blocks side-effects between later layer features, linearizing the prediction path. Iterative Patching re-runs the entire model (including unmodified layers) after each change. While this captures secondary and tertiary knock-on effects, the causal paths become more complex and harder to interpret.
2. Global circuit discovery (Context independent Global graph)
While the local graph shows features and attention-mediated paths activated only for specific prompts, the global weight circuit reveals more general addition graphs and how cases like harmful request rejection are aggregated. This captures repeated and consistent computation patterns that may not be visible in single prompts. Additionally, it enables "pre-filtering" of key feature pairs to more efficiently narrow down targets for local graph and patching experiments.
Global Weights
Global Weight is a constant calculated as the dot product of CLT decoder and encoder weights, representing the linear, prompt-independent influence (virtual weight) that feature has on feature "across all contexts".
- TWERA (Target-Weighted Expected Residual Attribution)
- ERA (Expected Residual Attribution)
In theory, we need to construct a matrix with rows and columns representing the sum of all layer features, but since it's too large, we create a TWERA/ERA filtering submatrix instead. Also, since all model features are mixed in a single residual stream, large global weights appear even for feature pairs that are never actually activated together. Therefore, we highlight only the weights when features are "actually activated together", removing out-of-spectrum interference.
Evaluation
Measure graph simplicity (path length), completeness, and mechanical fidelity (perturbation agreement)
- Path Length
- Measure how much the path between "embedding→logit" relies on short paths by examining the cumulative impact ratio of to calculate the average number of steps to reach the output
- Completeness
- The ratio of edge contribution sum from "feature·embedding→logit" in pruned graph to total edge contribution sum. The closer to 1, the smaller the error nodes (unexplained portions)
- Replacement Score
- The proportion of paths from "embedding→logit" that use only feature paths without errors
- Mechanistic Fidelity
- Cosine similarity between perturbations, measuring the similarity of downstream activation change vectors when patching the same features in the same way between original model vs. local replacement model
- Normalized MSE: Difference in magnitude of changes between the two models shows accumulated errors per layer
- Automated Interpretability
- Sort Eval: Show two features (top-k examples) to an LLM and measure the distinguishability index of feature visualization if they are decomposable
- Contrastive Eval: Present differing features from contrasting prompts (varying by one point) to LLM and measure the probability of correctly identifying the source prompt as an indicator of how well features capture differences
Limitation
Uninterpreted attention QK circuits, reconstruction errors ("dark matter"), graph complexity, ignored inactive features
Supernode is a grouping of multiple feature nodes into a "single semantic unit" to simplify visualization and causal effect analysis of nodes with similar roles. These are manually designated.
method
analysis