Learn Mechanistic Interpretability

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Feb 6 10:57
Editor
Edited
Edited
2026 Jan 9 16:27
 
 
 
 
 
How To Become A Mechanistic Interpretability Researcher — LessWrong
Note: If you’ll forgive the shameless self-promotion, applications for my MATS stream are open until Sept 12. I help people write a mech interp paper…
How To Become A Mechanistic Interpretability Researcher — LessWrong

AI researcher starting for interpretability

Interpretability with Sparse Autoencoders (Colab exercises) — LessWrong
Update (13th October 2024) - these exercises have been significantly expanded on. Now there are 2 exercise sets: the first one dives deeply into theo…
Interpretability with Sparse Autoencoders (Colab exercises) — LessWrong

Reading list

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 — AI Alignment Forum
This post represents my personal hot takes, not the opinions of my team or employer. This is a massively updated version of a similar list I made two…
An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 — AI Alignment Forum

Chris Olah

Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases
Mechanistic interpretability seeks to reverse engineer neural networks, similar to how one might reverse engineer a compiled binary computer program. After all, neural network parameters are in some sense a binary computer program which runs on one of the exotic virtual machines we call a neural network architecture.

Neel Nanda

The field of study of reverse engineering neural networks from the learned weights down to human-interpretable algorithms. Analogous to reverse engineering a compiled program binary back to source code.
A Comprehensive Mechanistic Interpretability Explainer & Glossary - Dynalist
Dynalist lets you organize your ideas and tasks in simple lists. It's powerful, yet easy to use. Try the live demo now, no need to sign up.
Concrete Steps to Get Started in Transformer Mechanistic Interpretability — Neel Nanda
Disclaimer : This post mostly links to resources I've made. I feel somewhat bad about this, sorry! Transformer MI is a pretty young and small field and there just aren't many people making educational resources tailored to it. Some links are to collations of other people's work, and I
history
The Story of Mech Interp
This is a talk I gave to my MATS scholars, with a stylised history of the field of mechanistic interpretability, as I see it (with a focus on the areas I've personally worked in, rather than claiming to be fully comprehensive). We stop at the start of sparse autoencoders, that part is coming soon! 00:00:00 Introduction & Scope 00:02:45 Three Core Themes 00:06:03 Grounding Research & Linearity 00:15:00 Early Vision Models 00:19:26 Feature Visualization Era 00:25:24 Interactive Tools & Adversarial Examples 00:32:00 Circuit Analysis in CNNs 00:37:42 Shift to Transformers 00:42:14 Grokking & Modular Addition 00:47:24 Causal Interventions Introduced 00:52:06 Activation Patching Method 00:58:29 Factual Recall Messiness 01:08:21 IOI Circuit Findings 01:13:20 Copy Suppression & Self-Correction 01:18:46 Backup Heads Problem 01:22:21 Superposition Challenge 01:28:00 Toy Models & Current Outlook 01:37:09 Q&A: Circuits Research Today 01:39:36 Q&A: Universality Across Models 01:48:18 Q&A: Adversarial Examples & Baselines 01:57:59 Q&A: Random Controls Matter 02:02:35 Q&A: Jailbreaks & SAE Analysis 02:08:14 Q&A: Probes & Robustness
The Story of Mech Interp

reaserch

Tips for Empirical Alignment Research — LessWrong
TLDR: I’ve collected some tips for research that I’ve given to other people and/or used myself, which have sped things up and helped put people in th…
Tips for Empirical Alignment Research — LessWrong

Youtubers

Goodfire
Our mission is to advance humanity's understanding of AI by examining the inner workings of advanced AI models (or "AI Interpretability"). As an applied research lab, we bridge the gap between theoretical science and practical applications of interpretability to build safer and more reliable AI models.
Goodfire
Principles of Intelligence
Principles of Intelligence (PrincInt) aims to facilitate knowledge transfer with the goal of building human-aligned AI systems. Our Fellowship (PIBBSS Fellowship) aims to draw experts from different fields and help them work on the most pressing issues in AI Risks. This channel is the repository for recorded talks, speaker events, and other materials relevant to PrincInt and PIBBSS-style research. You can learn more about us at www.princint.ai
Principles of Intelligence
 
 
 

Recommendations