Mechanistic Interpretability - NEEL NANDA (DeepMind)
http://80000hours.org/mlst
Visit our sponsor 80000 hours - grab their free career guide and check out their podcast! Use our special link above!
Support us! https://www.patreon.com/mlst
MLST Discord: https://discord.gg/aNPkGUQtc5
Twitter: https://twitter.com/MLStreetTalk
In this wide-ranging conversation, Tim Scarfe interviews Neel Nanda, a researcher at DeepMind working on mechanistic interpretability, which aims to understand the algorithms and representations learned by machine learning models. Neel discusses how models can represent their thoughts using motifs, circuits, and linear directional features which are often communicated via a "residual stream", an information highway models use to pass information between layers.
Neel argues that "superposition", the ability for models to represent more features than they have neurons, is one of the biggest open problems in interpretability. This is because superposition thwarts our ability to understand models by decomposing them into individual units of analysis. Despite this, Neel remains optimistic that ambitious interpretability is possible, citing examples like his work reverse engineering how models do modular addition.
Key areas of discussion:
* Mechanistic interpretability aims to reverse engineer and understand the inner workings of AI systems like neural networks. It could help ensure safety and alignment.
Neural networks seem to learn actual algorithms and processes for tasks, not just statistical correlations. This suggests interpretability may be possible.
* 'Grokking' refers to the phenomenon where neural networks suddenly generalize after initially memorizing. Understanding this transition required probing the underlying mechanisms.
* The 'superposition hypothesis' suggests neural networks represent more features than they have neurons by using non-orthogonal vectors. This poses challenges for interpretability.
* Transformers appear to implement algorithms using attention heads and other building blocks. Understanding this could enable interpreting their reasoning.
* Specific circuits like 'induction heads' seem to underlie capabilities like few-shot learning. Finding such circuits helps explain emergent phenomena.
* Causal interventions can isolate model circuits. Techniques like 'activation patching' substitute activations to determine necessity and sufficiency.
* We likely can't precisely control AI system goals now. Interpretability may reveal if systems have meaningful goal-directedness.
* Near-term risks like misuse seem more pressing than far-future risks like recursiveness. But better understanding now enables safety.
* Neel thinks we shouldn't "over-philosophize". The key issue is whether AI could pose catastrophic risk, not whether it fits abstract definitions.
What do YOU think? Let us know in the comments!
Neel Nanda: https://www.neelnanda.io/
https://www.youtube.com/channel/UCBMJ0D-omcRay8dh4QT0doQ
Pod version: https://podcasters.spotify.com/pod/show/machinelearningstreettalk/episodes/Neel-Nanda---Mechanistic-Interpretability-e25sibc
TOC
00:00:00 Intro
00:03:57 Discord questions
00:09:41 Chapter 1: Grokking and super position
00:32:32 Grokking start
01:07:29 How do ML models represent their thoughts
01:20:30 Orthello
01:41:29 Superposition
02:31:09 Chapter 2: Transformers discussion
02:41:06 Emergence
02:44:07 AI progress
02:57:01 Interp in the wild
03:09:26 Chapter 3: Superintelligence/XRisk
Transcript: https://docs.google.com/document/d/1FK1OepdJMrqpFK-_1Q3LQN6QLyLBvBwWW_5z8WrS1RI/edit?usp=sharing
Refs: https://docs.google.com/document/d/115dAroX0PzSduKr5F1V4CWggYcqIoSXYBhcxYktCnqY/edit?usp=sharing
See refs in pinned comment!
Interview filmed on May 31st 2023.
#artificialintelligence #machinelearning #deeplearning
https://www.youtube.com/watch?v=_Ygf0GnlwmY