Attribution Graph

Inference time Local graph

Visualize causal flows between features. Due to thousands or tens of thousands of edges being generated in the attribution graph, we only keep the paths that "contribute most significantly" to the model output (logit).

Nodes

The graph contains Embedding nodes, Feature nodes, Error nodes, and Logit nodes. Edges indicate direct contributions calculated as source node value × (linear) weight. Edge weights are categorized into two types: Residual-direct paths and Attention-mediated paths, distinguishing between connections through residual connections versus those through attention and OV circuits. In the Local graph, the original model's attention patterns(QK) are frozen to include the OV(output→value) stage, allowing us to track "which token positions, through which features, contributed to which token predictions." This enables visualization of how specific attention heads move information to particular features in a given prompt.

In contrast, the global graph only measures residual-direct paths (CLT decoder→residual→CLT encoder) since attention patterns (QK) change with every context. Attention-mediated paths are excluded from global analysis because they vary depending on the context.

Importance Score

Let be the normalized absolute value

Adjacency matrix containing "first-order (direct edge) influences". We can model the cumulative contributions of all causal interactions between graph nodes through paths of length k. Using

Neumann series

The final influence matrix excluding self-influence can be calculated as follows:

Attribution Graph Pruning

Create matrix A by taking absolute values of direct contributions (edges) between nodes (token embeddings, features, error nodes) and normalizing so incoming edges to each node sum to 1.

Calculate indirect contributions using , where B contains the summed influence of paths of all lengths.

Calculate influence scores by taking weighted averages of rows in B connected to logit nodes (e.g. final prediction tokens).

Perform "pruning" by removing low-importance nodes and edges, typically preserving 80-90% of total influence while reducing nodes to ~1/10th.

Attribution graph interactive demo

Global Addition Weights

https://transformer-circuits.pub/2025/attribution-graphs/static_js/addition/index.html?clickIdx=19085131

Neuronpedia Research
circuit-tracer
safety-research • Updated 2025 Nov 1 13:16

Two-step reasoning (e.g., Dallas→Texas→Austin) actually uses intermediate holes. Language-agnostic reasoning followed by language-specific feature combination. CLT shows better replacement score/sparsity tradeoff compared to PLT, while skip PLT generally offers fewer benefits.

The Circuits Research Landscape: Results and Perspectives - August 2025

A multi-organization interpretability project to replicate and extend circuit tracing research.