Causal Mechanistic Interpretability (Stanford lecture 1) - Atticus Geiger
How can we use the language of causality to understand and edit the internal mechanisms of AI models?
Atticus Geiger (Goodfire) gives a guest lecture on applying frameworks and tools from causal modeling to understand LLMs and other neural networks in Surya Ganguli's Stanford course APPPHYS 293.
00:00 - Intro
01:51 - Activation steering (e.g. Golden Gate Claude)
10:23 - Causal mediation analysis (understanding the contribution of an intermediate component)
21:42 - Causal abstraction methods (explaining a complex causal system with a simple one)
26:11 - Interchange interventions
40:46 - Distributed Alignment Search
54:54 - Lookback mechanisms: a case study in designing counterfactuals
Read more about our research: https://www.goodfire.ai/research
Follow us on X: https://x.com/GoodfireAI
https://www.youtube.com/watch?v=78Xa8VkH7-g&pp=0gcJCU0KAYcqIYzv