Interpretable AI

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 May 1 1:17
Editor
Edited
Edited
2026 Jan 9 16:11

Interpretability

Degree to which a model can be understood in human terms
Model inspection only provides information about the model. The model might not accurately reflect the data
Interpretability paradigms offer distinct lenses for understanding neural networks: Behavioral analyzes input-output relations; Attributional quantifies individual input feature influences; Concept-based identifies high-level representations governing behavior; Mechanistic uncovers precise causal mechanisms from inputs to outputs.
Interpretable AI Notion
 
 
 
Explainable AI Methods
 
 
 
 

Dream

Interpretability Dreams
Before diving in, it's worth making a few small remarks. Firstly, essentially all the ideas in this essay were previously articulated, but buried in previous papers. Our goal is just to surface those implicit visions, largely by quoting relevant parts. Secondly, it's important to note that everything in this essay is almost definitionally extremely speculative and uncertain. It's far from clear that any of it will ultimately be possible. Finally, since the goal of this essay is to lay out our personal vision of what's inspiring to us,  it may come across as a bit grandiose – we hope that it can be understood as simply trying to communicate subjective excitement in an open way.
Dario Amodei — The Urgency of Interpretability
In the decade that I have been working on AI, I’ve watched it grow from a tiny academic field to arguably the most important economic and geopolitical issue in the world.  In all that time, perhaps the most important lesson I’ve learned is this: the progress of the underlying technology is inexorable, driven by forces too powerful to stop, but the way in which it happens—the order in which things are built, the applications we choose, and the details of how it is rolled out to society—are eminently possible to change, and it’s possible to have great positive impact by doing so.  We can’t stop the bus, but we can steer it.  In the past I’ve written about the importance of deploying AI in a way that is positive for the world, and of ensuring that democracies build and wield the technology before autocracies do.  Over the last few months, I have become increasingly focused on an additional opportunity for steering the bus: the tantalizing possibility, opened up by some recent advances, that we could succeed at interpretability—that is, in understanding the inner workings of AI systems—before models reach an overwhelming level of power.
Dario Amodei — The Urgency of Interpretability
 
 

Recommendations