Interpretable AI

Creator

Creator

Created

Created

2024 May 1 1:17

Editor

Editor

Edited

Edited

2025 May 1 21:39

Refs

Refs

Dimension Reduction

Interpretability

Degree to which a model can be understood in human terms

Model inspection only provides information about the model. The model might not accurately reflect the data

\text{Explaining the model} ≠ \text{Explaining the data}

Interpretability paradigms offer distinct lenses for understanding neural networks: Behavioral analyzes input-output relations; Attributional quantifies individual input feature influences; Concept-based identifies high-level representations governing behavior; Mechanistic uncovers precise causal mechanisms from inputs to outputs.

https://arxiv.org/pdf/2404.14082

Interpretable AI Notion

Mechanistic interpretability

LLM Doc Transparency

Structured Latent

Interpretable Sparse Coding

Representation Similarity

Intrinsic interpretability

Interpretable Hierarchical Coding

Explainable AI Methods

Defection Probe

Faithfulness Interpretability

Minimality Interpretability

Simplicity Interpretability

AI Introspection

Meta interpretability

Challenges

https://arxiv.org/pdf/2501.16496

200 Concrete Open Problems in Mechanistic Interpretability: Introduction — AI Alignment Forum

EDIT 19/7/24: This sequence is now two years old, and fairly out of date. I hope it's still useful for historical reasons, but I no longer recommend…

200 Concrete Open Problems in Mechanistic Interpretability: Introduction — AI Alignment Forum

https://www.alignmentforum.org/posts/LbrPTJ4fmABEdEnLf/200-concrete-open-problems-in-mechanistic-interpretability

200 Concrete Open Problems in Mechanistic Interpretability: Introduction — AI Alignment Forum

https://arxiv.org/pdf/2404.09932

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team — LessWrong

Why we made this list: • * The interpretability team at Apollo Research wrapped up a few projects recently[1]. In order to decide what we’d work on…

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team — LessWrong

https://www.lesswrong.com/posts/KfkpgXdgRheSRWDy8/a-list-of-45-mech-interp-project-ideas-from-apollo-research

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team — LessWrong

Dream

Interpretability Dreams

Before diving in, it's worth making a few small remarks. Firstly, essentially all the ideas in this essay were previously articulated, but buried in previous papers. Our goal is just to surface those implicit visions, largely by quoting relevant parts. Secondly, it's important to note that everything in this essay is almost definitionally extremely speculative and uncertain. It's far from clear that any of it will ultimately be possible. Finally, since the goal of this essay is to lay out our personal vision of what's inspiring to us, it may come across as a bit grandiose – we hope that it can be understood as simply trying to communicate subjective excitement in an open way.

Interpretability Dreams

https://transformer-circuits.pub/2023/interpretability-dreams/index.html

Dario Amodei — The Urgency of Interpretability

In the decade that I have been working on AI, I’ve watched it grow from a tiny academic field to arguably the most important economic and geopolitical issue in the world. In all that time, perhaps the most important lesson I’ve learned is this: the progress of the underlying technology is inexorable, driven by forces too powerful to stop, but the way in which it happens—the order in which things are built, the applications we choose, and the details of how it is rolled out to society—are eminently possible to change, and it’s possible to have great positive impact by doing so. We can’t stop the bus, but we can steer it. In the past I’ve written about the importance of deploying AI in a way that is positive for the world, and of ensuring that democracies build and wield the technology before autocracies do. Over the last few months, I have become increasingly focused on an additional opportunity for steering the bus: the tantalizing possibility, opened up by some recent advances, that we could succeed at interpretability—that is, in understanding the inner workings of AI systems—before models reach an overwhelming level of power.

Dario Amodei — The Urgency of Interpretability

https://www.darioamodei.com/post/the-urgency-of-interpretability

Dario Amodei — The Urgency of Interpretability

Backlinks

Data Representation AI Researcher CSAIL Reasoning Model AI Alignment

Recommendations

///////