Attribution Graph

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 May 10 22:55
Editor
Edited
Edited
2025 Nov 3 12:14
Refs

Inference time Local graph

Visualize causal flows between features. Due to thousands or tens of thousands of edges being generated in the attribution graph, we only keep the paths that "contribute most significantly" to the model output (logit).

Nodes

The graph contains Embedding nodes, Feature nodes, Error nodes, and Logit nodes. Edges indicate direct contributions calculated as source node value × (linear) weight. Edge weights are categorized into two types: Residual-direct paths and Attention-mediated paths, distinguishing between connections through residual connections versus those through attention and OV circuits. In the Local graph, the original model's attention patterns(QK) are frozen to include the OV(output→value) stage, allowing us to track "which token positions, through which features, contributed to which token predictions." This enables visualization of how specific attention heads move information to particular features in a given prompt.
In contrast, the global graph only measures residual-direct paths (CLT decoder→residual→CLT encoder) since attention patterns (QK) change with every context. Attention-mediated paths are excluded from global analysis because they vary depending on the context.

Importance Score

Let be the normalized absolute value
Adjacency matrix
containing "first-order (direct edge) influences". We can model the cumulative contributions of all causal interactions between graph nodes through paths of length k. Using
Neumann series
The final influence matrix excluding self-influence can be calculated as follows:

Attribution Graph Pruning

  1. Create matrix A by taking absolute values of direct contributions (edges) between nodes (token embeddings, features, error nodes) and normalizing so incoming edges to each node sum to 1.
  1. Calculate indirect contributions using , where B contains the summed influence of paths of all lengths.
  1. Calculate influence scores by taking weighted averages of rows in B connected to logit nodes (e.g. final prediction tokens).
  1. Perform "pruning" by removing low-importance nodes and edges, typically preserving 80-90% of total influence while reducing nodes to ~1/10th.
 
 
 
Attribution graph interactive demo

Neuronpedia
Research
circuit-tracer
safety-researchUpdated 2025 Nov 1 13:16

Two-step reasoning (e.g., Dallas→Texas→Austin) actually uses intermediate holes. Language-agnostic reasoning followed by language-specific feature combination. CLT shows better replacement score/sparsity tradeoff compared to PLT, while skip PLT generally offers fewer benefits.

Jailbreaking with
Prompt Injection

 
 

Recommendations