Neel Nanda
Research Engineer, Google DeepMind - Cited by 2,018 - AI - ML - AI Alignment - Interpretability - Mechanistic Interpretability
https://scholar.google.com/citations?user=GLnX3MkAAAAJ&hl=en
Mechanistic Interpretability - NEEL NANDA (DeepMind)
http://80000hours.org/mlst
Visit our sponsor 80000 hours - grab their free career guide and check out their podcast! Use our special link above!
Support us! https://www.patreon.com/mlst
MLST Discord: https://discord.gg/aNPkGUQtc5
Twitter: https://twitter.com/MLStreetTalk
In this wide-ranging conversation, Tim Scarfe interviews Neel Nanda, a researcher at DeepMind working on mechanistic interpretability, which aims to understand the algorithms and representations learned by machine learning models. Neel discusses how models can represent their thoughts using motifs, circuits, and linear directional features which are often communicated via a "residual stream", an information highway models use to pass information between layers.
Neel argues that "superposition", the ability for models to represent more features than they have neurons, is one of the biggest open problems in interpretability. This is because superposition thwarts our ability to understand models by decomposing them into individual units of analysis. Despite this, Neel remains optimistic that ambitious interpretability is possible, citing examples like his work reverse engineering how models do modular addition.
Key areas of discussion:
* Mechanistic interpretability aims to reverse engineer and understand the inner workings of AI systems like neural networks. It could help ensure safety and alignment.
Neural networks seem to learn actual algorithms and processes for tasks, not just statistical correlations. This suggests interpretability may be possible.
* 'Grokking' refers to the phenomenon where neural networks suddenly generalize after initially memorizing. Understanding this transition required probing the underlying mechanisms.
* The 'superposition hypothesis' suggests neural networks represent more features than they have neurons by using non-orthogonal vectors. This poses challenges for interpretability.
* Transformers appear to implement algorithms using attention heads and other building blocks. Understanding this could enable interpreting their reasoning.
* Specific circuits like 'induction heads' seem to underlie capabilities like few-shot learning. Finding such circuits helps explain emergent phenomena.
* Causal interventions can isolate model circuits. Techniques like 'activation patching' substitute activations to determine necessity and sufficiency.
* We likely can't precisely control AI system goals now. Interpretability may reveal if systems have meaningful goal-directedness.
* Near-term risks like misuse seem more pressing than far-future risks like recursiveness. But better understanding now enables safety.
* Neel thinks we shouldn't "over-philosophize". The key issue is whether AI could pose catastrophic risk, not whether it fits abstract definitions.
What do YOU think? Let us know in the comments!
Neel Nanda: https://www.neelnanda.io/
https://www.youtube.com/channel/UCBMJ0D-omcRay8dh4QT0doQ
Pod version: https://podcasters.spotify.com/pod/show/machinelearningstreettalk/episodes/Neel-Nanda---Mechanistic-Interpretability-e25sibc
TOC
00:00:00 Intro
00:03:57 Discord questions
00:09:41 Chapter 1: Grokking and super position
00:32:32 Grokking start
01:07:29 How do ML models represent their thoughts
01:20:30 Orthello
01:41:29 Superposition
02:31:09 Chapter 2: Transformers discussion
02:41:06 Emergence
02:44:07 AI progress
02:57:01 Interp in the wild
03:09:26 Chapter 3: Superintelligence/XRisk
Transcript: https://docs.google.com/document/d/1FK1OepdJMrqpFK-_1Q3LQN6QLyLBvBwWW_5z8WrS1RI/edit?usp=sharing
Refs: https://docs.google.com/document/d/115dAroX0PzSduKr5F1V4CWggYcqIoSXYBhcxYktCnqY/edit?usp=sharing
See refs in pinned comment!
Interview filmed on May 31st 2023.
#artificialintelligence #machinelearning #deeplearning
https://www.youtube.com/watch?v=_Ygf0GnlwmY

The Story of Mech Interp
This is a talk I gave to my MATS scholars, with a stylised history of the field of mechanistic interpretability, as I see it (with a focus on the areas I've personally worked in, rather than claiming to be fully comprehensive). We stop at the start of sparse autoencoders, that part is coming soon!
00:00:00 Introduction & Scope
00:02:45 Three Core Themes
00:06:03 Grounding Research & Linearity
00:15:00 Early Vision Models
00:19:26 Feature Visualization Era
00:25:24 Interactive Tools & Adversarial Examples
00:32:00 Circuit Analysis in CNNs
00:37:42 Shift to Transformers
00:42:14 Grokking & Modular Addition
00:47:24 Causal Interventions Introduced
00:52:06 Activation Patching Method
00:58:29 Factual Recall Messiness
01:08:21 IOI Circuit Findings
01:13:20 Copy Suppression & Self-Correction
01:18:46 Backup Heads Problem
01:22:21 Superposition Challenge
01:28:00 Toy Models & Current Outlook
01:37:09 Q&A: Circuits Research Today
01:39:36 Q&A: Universality Across Models
01:48:18 Q&A: Adversarial Examples & Baselines
01:57:59 Q&A: Random Controls Matter
02:02:35 Q&A: Jailbreaks & SAE Analysis
02:08:14 Q&A: Probes & Robustness
https://www.youtube.com/watch?v=kkfLHmujzO8


Seonglae Cho