Neuronpedia Research circuit-tracersafety-research • Updated 2026 Feb 10 14:51
circuit-tracer
safety-research • Updated 2026 Feb 10 14:51
Two-step reasoning (e.g., Dallas→Texas→Austin) actually uses intermediate holes. Language-agnostic reasoning followed by language-specific feature combination. CLT shows better replacement score/sparsity tradeoff compared to PLT, while skip PLT generally offers fewer benefits.
The Circuits Research Landscape: Results and Perspectives - August 2025
A multi-organization interpretability project to replicate and extend circuit tracing research.
https://www.neuronpedia.org/graph/info

Attribution Graphs for Dummies - 1. What are Attribution Graphs?
Part 2: https://youtu.be/hdi1a9MjwDs
An introduction to attribution graphs from Anthropic's Circuit Tracing and Model Biology papers, featuring Jack Lindsey (Anthropic), Emmanuel Ameisen (Anthropic), Tom McGrath (Goodfire AI), and Neel Nanda (Google DeepMind).
0:00 Introduction
2:18 Attribution Graph Orientation
19:10 Analyzing an Attribution Graph from Scratch
40:25 Reflection: What have we Learned?
Explore Attribution Graphs: https://neuronpedia.org/graph
Blog Post: https://www.neuronpedia.org/graph/info
circuit-tracer GitHub: https://github.com/safety-research/circuit-tracer
Original Papers by Anthropic
- Circuit Tracing: https://transformer-circuits.pub/2025/attribution-graphs/methods.html
- Biology of an LLM: https://transformer-circuits.pub/2025/attribution-graphs/biology.html
https://www.youtube.com/watch?v=ruLcDtr_cGo


Seonglae Cho