Causal abstraction

Given a task and model, MI aims to discover a succinct algorithmic process, an interpretation, that explains the model’s decision process on that task.

Causal abstraction is a theoretical framework that examines whether a homomorphism exists between low-level variables (neurons, heads) → high-level conceptual variables

Causal abstractions

Interpretive Equivalence

DAS

Causal Scrubbing

Deterministic Causal Model

A (deterministic) causal model with components is a quadruple where is a set of hidden variables such that , is an input variable, and defines a partial ordering over .

Hidden variables are intermediate computation results or activations that indicate the model state and represent nodes in the computational graph. Functions are considered as edges with a partial order that connects them to form a

DAG computational graph.

For each

where and are the parents of .

arxiv.org

https://arxiv.org/pdf/2301.04709

Non‑linear Representation Dilemma

Causal abstraction uses an alignment map to connect model hidden states ↔ intermediate variables of an algorithm. However, the definition itself does not restrict the map to be linear. It is shown that if the map is made sufficiently powerful, almost any model can be aligned with almost any algorithm through 'intervention-consistency' (including existence/learning experiments).

arxiv.org

https://arxiv.org/pdf/2507.08802

Stanford video, Mechanistic interpretability that aims for causal grounding.

Causal Mechanistic Interpretability (Stanford lecture 1) - Atticus Geiger

How can we use the language of causality to understand and edit the internal mechanisms of AI models? Atticus Geiger (Goodfire) gives a guest lecture on applying frameworks and tools from causal modeling to understand LLMs and other neural networks in Surya Ganguli's Stanford course APPPHYS 293. 00:00 - Intro 01:51 - Activation steering (e.g. Golden Gate Claude) 10:23 - Causal mediation analysis (understanding the contribution of an intermediate component) 21:42 - Causal abstraction methods (explaining a complex causal system with a simple one) 26:11 - Interchange interventions 40:46 - Distributed Alignment Search 54:54 - Lookback mechanisms: a case study in designing counterfactuals Read more about our research: https://www.goodfire.ai/research Follow us on X: https://x.com/GoodfireAI

https://www.youtube.com/watch?v=78Xa8VkH7-g&pp=0gcJCU0KAYcqIYzv

Causal abstraction

Deterministic Causal Model

Non‑linear Representation Dilemma

Backlinks

Recommendations