MATS 10 Neel Nanda

Pragmatic Interpretability

The traditional "complete reverse engineering" approach has very slow progress. Instead of reverse engineering the entire structure, we shift toward pragmatic interpretability that directly solves real-world safety problems.

Without feedback loops, self-deception becomes easy → Proxy Tasks (measurable surrogate tasks) are essential. Even in SAEs research, metrics like "reconstruction error" turned out to be nearly meaningless. Instead, testing performance on proxies like OOD generalization, unlearning, and hidden goal extraction revealed the real limitations clearly.

This is where the criticism of SAEs appears again: means often become ends. It's easy to stop at "we saw something with SAE." Be wary of using SAE when simpler methods would work. Does this actually help us understand the model better? Or did we just extract a lot of features?

A Pragmatic Vision for Interpretability — AI Alignment Forum

Executive Summary * The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engi…

https://www.alignmentforum.org/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability

제출물 형식: 구글 문서 1개. 맨 앞에 Executive Summary(1–3페이지, 최대 600단어) 포함. 그래프 필수, 코드 필수 아님(필요시만 참고).

Executive Summary 핵심

연구 문제와 왜 중요한지

핵심 결론/가장 흥미로운 발견

주요 실험별로: 무엇을 했고, 무엇을 발견했고, 왜 결론을 지지하는지 (그래프 포함)

기존 mech interp 연구 제출 가능

기존 논문/블로그 요약 제출 가능(링크 포함)

소요 시간 추정치, 본인 기여도 명시

mech interp 관련성 설명 필요

일반 지원보다 더 엄격하게 평가

시간 제한 (20+2시간)

미포함: 사전 공부, 일반 환경 세팅, 휴식, 학습 대기 시간, 지원서 작성

포함: 코드 작성, 관련 논문 읽기, 분석/실험, 계획·사고, 문서 작성

Executive Summary는 추가 2시간 허용

기타

지원서는 미니 연구 프로젝트처럼 접근하라: 탐색 → 가설 검증 → 명확한 정리.

탐색(Exploration): 빠르게 많이 만져보고 직관 쌓기. 정보 획득 속도가 핵심.

이해(Understanding): 가설을 세우고 실험으로 스스로 설득. 대안 설명을 항상 경계.

정리(Distillation): 결과를 남이 이해할 수 있게 쓰는 게 가장 중요. 글 못 쓰면 탈락.

한 가지 좋은 인사이트가 여러 얕은 실험보다 낫다.

LLM 사용 적극 권장: 학습, 아이디어 정리, 코드 보조, 비판적 피드백에 활용.

LLM이 쓴 글 그대로 제출은 비추, 대신 초안·피드백 용도로 활용.

시간 배분: 읽기 ≤ 5시간, 나머지는 코드·실험·작성.

정성 예시만 나열하는 건 큰 감점 요소, 가능한 한 baseline과 비교하라.

Please link a Google Doc with your executive summary and main project write up. Make sure to let anyone view it! Applications without a doc will be rejected.

(Optional) Link to any other relevant outputs (code, colab, etc)

The first 1-3 pages of the attached doc are an executive summary. *

The document permissions are set so that anyone with the link can see my doc (so both Neel and his helpers can see). *

What question did you try to answer? * Max 30 words

What conclusions have you reached about this research problem? * This should look like a list of hypotheses and empirical claims you've shown (or disproven!). Max 50 words

What is the strongest evidence you found for and against these hypotheses? * Max 75 words

What are the biggest limitations to your results? Could you have addressed them? * Max 50 words. Please be honest! It's much better to flag a limitation yourself than for me to need to figure it out.

What, if any, prior experience do you have with mechanistic interpretability? *


I wrote 2 Steering paper and 1 SAE training dataset resaerch.
1 Steering paper is method extracting static steering vector for general llm tasks.
The other steering paper is dyanamic steering method by small control network that observes residual steram as a Markov Decision Process.
The SAE training dataset tryied to use self-generated synthetic data to reach true interpretability without exteranl web dataset dependency

- Static Steering https://openreview.net/forum?id=H1kO6Mncl8
- Dyanmic Steering https://openreview.net/forum?id=jiPrwmMb2e
- SAE Dataset https://aclanthology.org/2025.acl-srw.20/

Other than your research task, what are 1-3 pieces of evidence that you'd be able to do good research in the program? Please concisely describe them and why they're relevant. * Aim for 50-100 words, max 200. These don't have to be standard credentials! Unusual backgrounds welcome.


mechanistic interpretability 에 대한 내 career 과거 선택은 중요한 증거다. 
나는 mech interp 연구를 위해 london 에 master ucl 에 왔고 지금까지 1저자 논문 3개 전부 mech interp 연구이다.

Why are you interested in Neel's stream specifically? *


in this domain 에서 영향력을 인정하고 mathematical aspect 에서 도움받고싶다 내 아이디어들과 결합하도록.
그리고 여러 idea 를 접해본 사람이라는 점에서 내 여러 아이디어의 퀄리티와 분야에서 영향력을 검증해보고싶다.

What is the likelihood you will join Neel Nanda's training program (Feb 2 - March 6), if accepted? *


지금 full time ai safety company 에 일하고 있고 mech interp research 경험이 있어서 적다 10프로. 
정말 도움이 많이 된다면 참여하겟지만 주로 하고싶은 일은 research 에 대한 discussion 과 idea feedback을 얻고싶다 이프로그램 통해

How did you use LLMs in this research task and write-up? How much did you check their work, and why? See my advice on LLM usage here.

Neel Nanda application (Summer 2026)

Made with Fillout, the best way to make forms, surveys and quizzes your audience will answer.

https://forms.matsprogram.org/neel10

MATS 10 Neel Nanda

Pragmatic Interpretability

취사적 정보 수용

Recommendations