Neel Nanda

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 May 21 14:10
Editor
Edited
Edited
2024 Oct 25 22:50
 
 
 
 
Mechanistic Interpretability - NEEL NANDA (DeepMind)
http://80000hours.org/mlst Visit our sponsor 80000 hours - grab their free career guide and check out their podcast! Use our special link above! Support us! https://www.patreon.com/mlst MLST Discord: https://discord.gg/aNPkGUQtc5 Twitter: https://twitter.com/MLStreetTalk In this wide-ranging conversation, Tim Scarfe interviews Neel Nanda, a researcher at DeepMind working on mechanistic interpretability, which aims to understand the algorithms and representations learned by machine learning models. Neel discusses how models can represent their thoughts using motifs, circuits, and linear directional features which are often communicated via a "residual stream", an information highway models use to pass information between layers. Neel argues that "superposition", the ability for models to represent more features than they have neurons, is one of the biggest open problems in interpretability. This is because superposition thwarts our ability to understand models by decomposing them into individual units of analysis. Despite this, Neel remains optimistic that ambitious interpretability is possible, citing examples like his work reverse engineering how models do modular addition. Key areas of discussion: * Mechanistic interpretability aims to reverse engineer and understand the inner workings of AI systems like neural networks. It could help ensure safety and alignment. Neural networks seem to learn actual algorithms and processes for tasks, not just statistical correlations. This suggests interpretability may be possible. * 'Grokking' refers to the phenomenon where neural networks suddenly generalize after initially memorizing. Understanding this transition required probing the underlying mechanisms. * The 'superposition hypothesis' suggests neural networks represent more features than they have neurons by using non-orthogonal vectors. This poses challenges for interpretability. * Transformers appear to implement algorithms using attention heads and other building blocks. Understanding this could enable interpreting their reasoning. * Specific circuits like 'induction heads' seem to underlie capabilities like few-shot learning. Finding such circuits helps explain emergent phenomena. * Causal interventions can isolate model circuits. Techniques like 'activation patching' substitute activations to determine necessity and sufficiency. * We likely can't precisely control AI system goals now. Interpretability may reveal if systems have meaningful goal-directedness. * Near-term risks like misuse seem more pressing than far-future risks like recursiveness. But better understanding now enables safety. * Neel thinks we shouldn't "over-philosophize". The key issue is whether AI could pose catastrophic risk, not whether it fits abstract definitions. What do YOU think? Let us know in the comments! Neel Nanda: https://www.neelnanda.io/ https://www.youtube.com/channel/UCBMJ0D-omcRay8dh4QT0doQ Pod version: https://podcasters.spotify.com/pod/show/machinelearningstreettalk/episodes/Neel-Nanda---Mechanistic-Interpretability-e25sibc TOC 00:00:00 Intro 00:03:57 Discord questions 00:09:41 Chapter 1: Grokking and super position 00:32:32 Grokking start 01:07:29 How do ML models represent their thoughts 01:20:30 Orthello 01:41:29 Superposition 02:31:09 Chapter 2: Transformers discussion 02:41:06 Emergence 02:44:07 AI progress 02:57:01 Interp in the wild 03:09:26 Chapter 3: Superintelligence/XRisk Transcript: https://docs.google.com/document/d/1FK1OepdJMrqpFK-_1Q3LQN6QLyLBvBwWW_5z8WrS1RI/edit?usp=sharing Refs: https://docs.google.com/document/d/115dAroX0PzSduKr5F1V4CWggYcqIoSXYBhcxYktCnqY/edit?usp=sharing See refs in pinned comment! Interview filmed on May 31st 2023. #artificialintelligence #machinelearning #deeplearning
Mechanistic Interpretability - NEEL NANDA (DeepMind)
 
 

Recommendations