Pragmatic Interpretability

Ambitious interpretability (Theoretical Mechanistic Interpretability)

We want to understand model. Let's decompose activation and components of neural network and do causal analysis to completely understand them. Often takes a theoretical, philosophical, and mathematical approach.

Pragmatic Interpretability

We want to understand model, how can we make models more safer using interpretability techniques. Experiment-based engineering reductionist approach.

Constructive Interpretability

We want to improve the model based on our understanding of interpretability. We know which parts are problematic and which parts contribute to intelligence. How can we leverage this information and change the structure of the model to achieve AGI or better models?

Pragmatic Interpretability

The traditional "complete reverse engineering" approach has very slow progress. Instead of reverse engineering the entire structure, we shift toward pragmatic interpretability that directly solves real-world safety problems.

Without feedback loops, self-deception becomes easy → Proxy Tasks (measurable surrogate tasks) are essential. Even in SAEs research, metrics like "reconstruction error" turned out to be nearly meaningless. Instead, testing performance on proxies like OOD generalization, unlearning, and hidden goal extraction revealed the real limitations clearly.

This is where the criticism of SAEs appears again: means often become ends. It's easy to stop at "we saw something with SAE." Be wary of using SAE when simpler methods would work. Does this actually help us understand the model better? Or did we just extract a lot of features?

A Pragmatic Vision for Interpretability — AI Alignment Forum

Executive Summary * The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engi…

https://www.alignmentforum.org/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability

Ambitious Mechanistic Interpretability from
Leo Gao

Short-term, outcome-focused pragmatic interpretability risks optimizing for superficial signals, making it hard to understand why failures occur in the long run and leaving systems fragile. The strength of AMI lies in achieving debugger-level internal understanding that clearly distinguishes between hypotheses and offers knowledge that may generalize even to radically different future AGI systems. Recent research has successfully identified much simpler and more interpretable circuits than in the past (e.g., IOI) by leveraging circuit sparsity.

An Ambitious Vision for Interpretability — LessWrong

The goal of ambitious mechanistic interpretability (AMI) is to fully understand how neural networks work. While some have pivoted towards more pragma…