Circuit Tracing

Transcoder based Attribution tracing

CLT compresses amplification chains across multiple MLP layers into a single feature representation, resulting in shorter path lengths in the graph

CLT globally optimizes MLP outputs across layers (joint training), achieving lower MSE than PLT while explaining much more variance than thresholded neurons

Overall Pipeline

Build a replacement model by substituting the original model's MLP with a "Cross-Layer Transcoder (CLT)"

Visualize the computational flow of the replacement model as an "attribution graph" for specific prompts

Identify critical paths in the graph

Verify the causal role of individual features using "patching" techniques

Local Replacement Model

To achieve 100% output matching with the original model for the given prompt:

Replace all MLP subblocks in attention blocks with CLT

Freeze the attention patterns and layer-norm denominators from the original model

Correct node output differences (errors) for each token and layer using error nodes

1. Attribution graph (inference time Local graph)

Visualize causal flows between features. Due to thousands or tens of thousands of edges being generated in the attribution graph, we only keep the paths that "contribute most significantly" to the model output (logit).

Nodes

The graph contains Embedding nodes, Feature nodes, Error nodes, and Logit nodes. Edges indicate direct contributions calculated as source node value × (linear) weight. Edge weights are categorized into two types: Residual-direct paths and Attention-mediated paths, distinguishing between connections through residual connections versus those through attention and OV circuits. In the Local graph, the original model's attention patterns(QK) are frozen to include the OV(output→value) stage, allowing us to track "which token positions, through which features, contributed to which token predictions." This enables visualization of how specific attention heads move information to particular features in a given prompt.

In contrast, the global graph only measures residual-direct paths (CLT decoder→residual→CLT encoder) since attention patterns (QK) change with every context. Attention-mediated paths are excluded from global analysis because they vary depending on the context.

Importance Score

Let

A

be the normalized absolute value

Adjacency matrix containing "first-order (direct edge) influences". We can model the cumulative contributions of all causal interactions between graph nodes through paths of length k. Using

Neumann series

I + A + A^2 + A^3 + \cdots = (I - A)^{-1}

The final influence matrix excluding self-influence can be calculated as follows:

B = (I - A)^{-1} - I = A + A^2 + A^3 + \cdots

Attribution Graph Pruning

Create matrix A by taking absolute values of direct contributions (edges) between nodes (token embeddings, features, error nodes) and normalizing so incoming edges to each node sum to 1.

Calculate indirect contributions using $B = (I - A)^{-1} - I$ , where B contains the summed influence of paths of all lengths.

Calculate influence scores by taking weighted averages of rows in B connected to logit nodes (e.g. final prediction tokens).

Perform "pruning" by removing low-importance nodes and edges, typically preserving 80-90% of total influence while reducing nodes to ~1/10th.

Causal Role Verification

First, run the base model to record MLP outputs at each layer and store the decoder contributions of CLT features.

In the second forward pass, multiply all decoder contributions of selected features by a coefficient and replace with "recorded MLP output + $M \times Decoder$ ".

Analyze logit changes by considering suppression, removal, and amplification results from patching.

Constrained Patching modifies CLT decoder contributions only for specific layer ranges and fixes subsequent layers. This blocks side-effects between later layer features, linearizing the prediction path. Iterative Patching re-runs the entire model (including unmodified layers) after each change. While this captures secondary and tertiary knock-on effects, the causal paths become more complex and harder to interpret.

2. Global circuit discovery (Context independent Global graph)

While the local graph shows features and attention-mediated paths activated only for specific prompts, the global weight circuit reveals more general addition graphs and how cases like harmful request rejection are aggregated. This captures repeated and consistent computation patterns that may not be visible in single prompts. Additionally, it enables "pre-filtering" of key feature pairs to more efficiently narrow down targets for local graph and patching experiments.

Global Weights

Global Weight

V_{ij}

is a constant calculated as the dot product of CLT decoder and encoder weights, representing the linear, prompt-independent influence (virtual weight) that feature

i

has on feature

j

"across all contexts".

TWERA (Target-Weighted Expected Residual Attribution)

ERA (Expected Residual Attribution)

In theory, we need to construct a matrix with rows and columns representing the sum of all layer features, but since it's too large, we create a TWERA/ERA filtering submatrix instead. Also, since all model features are mixed in a single residual stream, large global weights appear even for feature pairs that are never actually activated together. Therefore, we highlight only the weights when features are "actually activated together", removing out-of-spectrum interference.

\mathrm{TWERA}_{ij}= \frac{\mathbb{E}\bigl[a_j\,a_i\bigr]}{\mathbb{E}\bigl[a_j\bigr]}\;V_{ij}

\mathrm{ERA}_{ij} = \mathbb{E}\bigl[\,\mathbf{1}(a_j > 0)\,a_i\bigr]\;V_{ij},

Evaluation

Measure graph simplicity (path length), completeness, and mechanical fidelity (perturbation agreement)

Path Length

Measure how much the path between "embedding→logit" relies on short paths by examining the cumulative impact ratio of $A^k$ to calculate the average number of steps to reach the output

Completeness

The ratio of edge contribution sum from "feature·embedding→logit" in pruned graph to total edge contribution sum. The closer to 1, the smaller the error nodes (unexplained portions)

Replacement Score

The proportion of paths from "embedding→logit" that use only feature paths without errors

Mechanistic Fidelity

Cosine similarity between perturbations, measuring the similarity of downstream activation change vectors when patching the same features in the same way between original model vs. local replacement model
Normalized MSE: Difference in magnitude of changes between the two models shows accumulated errors per layer

Automated Interpretability

Sort Eval: Show two features (top-k examples) to an LLM and measure the distinguishability index of feature visualization if they are decomposable
Contrastive Eval: Present differing features from contrasting prompts (varying by one point) to LLM and measure the probability of correctly identifying the source prompt as an indicator of how well features capture differences

Limitation

Uninterpreted attention QK circuits, reconstruction errors ("dark matter"), graph complexity, ignored inactive features

Supernode is a grouping of multiple feature nodes into a "single semantic unit" to simplify visualization and causal effect analysis of nodes with similar roles. These are manually designated.

method

Circuit Tracing: Revealing Computational Graphs in Language Models

We describe an approach to tracing the “step-by-step” computation involved when a model responds to a single prompt.

https://transformer-circuits.pub/2025/attribution-graphs/methods.html

analysis

On the Biology of a Large Language Model

We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology.

https://transformer-circuits.pub/2025/attribution-graphs/biology.html#dives-tracing