Mechanistic interpretability Academic

Ambitious interpretability (Theoretical Mechanistic Interpretability)

We want to understand model. Let's decompose activation and components of neural network and do causal analysis to completely understand them. Often takes a theoretical, philosophical, and mathematical approach.

Pragmatic Interpretability

We want to understand model, how can we make models more safer using interpretability techniques. Experiment-based engineering reductionist approach.

Constructive Interpretability

We want to improve the model based on our understanding of interpretability. We know which parts are problematic and which parts contribute to intelligence. How can we leverage this information and change the structure of the model to achieve AGI or better models?

Mechanistic interpretability Academic

Ambitious interpretability (Theoretical Mechanistic Interpretability)

Pragmatic Interpretability

Constructive Interpretability

Recommendations