Given a task and model, MI aims to discover a succinct algorithmic process, an interpretation, that explains the model’s decision process on that task.
Causal abstraction is a theoretical framework that examines whether a homomorphism exists between low-level variables (neurons, heads) → high-level conceptual variables
Causal abstractions
Deterministic Causal Model
A (deterministic) causal model with components is a quadruple where is a set of hidden variables such that , is an input variable, and defines a partial ordering over .
Hidden variables are intermediate computation results or activations that indicate the model state and represent nodes in the computational graph. Functions are considered as edges with a partial order that connects them to form a DAG computational graph.
For each
where and are the parents of .
Non‑linear Representation Dilemma
Causal abstraction uses an alignment map to connect model hidden states ↔ intermediate variables of an algorithm. However, the definition itself does not restrict the map to be linear. It is shown that if the map is made sufficiently powerful, almost any model can be aligned with almost any algorithm through 'intervention-consistency' (including existence/learning experiments).
Stanford video, Mechanistic interpretability that aims for causal grounding.
Causal Mechanistic Interpretability (Stanford lecture 1) - Atticus Geiger
How can we use the language of causality to understand and edit the internal mechanisms of AI models?
Atticus Geiger (Goodfire) gives a guest lecture on applying frameworks and tools from causal modeling to understand LLMs and other neural networks in Surya Ganguli's Stanford course APPPHYS 293.
00:00 - Intro
01:51 - Activation steering (e.g. Golden Gate Claude)
10:23 - Causal mediation analysis (understanding the contribution of an intermediate component)
21:42 - Causal abstraction methods (explaining a complex causal system with a simple one)
26:11 - Interchange interventions
40:46 - Distributed Alignment Search
54:54 - Lookback mechanisms: a case study in designing counterfactuals
Read more about our research: https://www.goodfire.ai/research
Follow us on X: https://x.com/GoodfireAI
https://www.youtube.com/watch?v=78Xa8VkH7-g&pp=0gcJCU0KAYcqIYzv


Seonglae Cho