MATS 10 Neel Nanda

Created
Created
2025 Dec 20 23:20
Creator
Creator
Seonglae ChoSeonglae Cho
Editor
Edited
Edited
2026 Jan 3 3:19
Refs
Refs

Pragmatic Interpretability

The traditional "complete reverse engineering" approach has very slow progress. Instead of reverse engineering the entire structure, we shift toward pragmatic interpretability that directly solves real-world safety problems.
Without feedback loops, self-deception becomes easy → Proxy Tasks (measurable surrogate tasks) are essential. Even in SAEs research, metrics like "reconstruction error" turned out to be nearly meaningless. Instead, testing performance on proxies like OOD generalization, unlearning, and hidden goal extraction revealed the real limitations clearly.
This is where the criticism of SAEs appears again: means often become ends. It's easy to stop at "we saw something with SAE." Be wary of using SAE when simpler methods would work. Does this actually help us understand the model better? Or did we just extract a lot of features?
A Pragmatic Vision for Interpretability — AI Alignment Forum
Executive Summary * The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engi…
A Pragmatic Vision for Interpretability — AI Alignment Forum
제출물 형식: 구글 문서 1개. 맨 앞에 Executive Summary(1–3페이지, 최대 600단어) 포함. 그래프 필수, 코드 필수 아님(필요시만 참고).
Executive Summary 핵심
  • 연구 문제와 왜 중요한지
  • 핵심 결론/가장 흥미로운 발견
  • 주요 실험별로: 무엇을 했고, 무엇을 발견했고, 왜 결론을 지지하는지 (그래프 포함)
기존 mech interp 연구 제출 가능
  • 기존 논문/블로그 요약 제출 가능(링크 포함)
  • 소요 시간 추정치, 본인 기여도 명시
  • mech interp 관련성 설명 필요
  • 일반 지원보다 더 엄격하게 평가
시간 제한 (20+2시간)
  • 미포함: 사전 공부, 일반 환경 세팅, 휴식, 학습 대기 시간, 지원서 작성
  • 포함: 코드 작성, 관련 논문 읽기, 분석/실험, 계획·사고, 문서 작성
  • Executive Summary는 추가 2시간 허용
기타
  • 지원서는 미니 연구 프로젝트처럼 접근하라: 탐색 → 가설 검증 → 명확한 정리.
  • 탐색(Exploration): 빠르게 많이 만져보고 직관 쌓기. 정보 획득 속도가 핵심.
  • 이해(Understanding): 가설을 세우고 실험으로 스스로 설득. 대안 설명을 항상 경계.
  • 정리(Distillation): 결과를 남이 이해할 수 있게 쓰는 게 가장 중요. 글 못 쓰면 탈락.
  • 한 가지 좋은 인사이트가 여러 얕은 실험보다 낫다.
  • LLM 사용 적극 권장: 학습, 아이디어 정리, 코드 보조, 비판적 피드백에 활용.
  • LLM이 쓴 글 그대로 제출은 비추, 대신 초안·피드백 용도로 활용.
  • 시간 배분: 읽기 ≤ 5시간, 나머지는 코드·실험·작성.
  • 정성 예시만 나열하는 건 큰 감점 요소, 가능한 한 baseline과 비교하라.
Please link a Google Doc with your executive summary and main project write up. Make sure to let anyone view it! Applications without a doc will be rejected.
(Optional) Link to any other relevant outputs (code, colab, etc)
The first 1-3 pages of the attached doc are an executive summary. *
The document permissions are set so that anyone with the link can see my doc (so both Neel and his helpers can see). *
What question did you try to answer? * Max 30 words
What conclusions have you reached about this research problem? * This should look like a list of hypotheses and empirical claims you've shown (or disproven!). Max 50 words
What is the strongest evidence you found for and against these hypotheses? * Max 75 words
What are the biggest limitations to your results? Could you have addressed them? * Max 50 words. Please be honest! It's much better to flag a limitation yourself than for me to need to figure it out.
What, if any, prior experience do you have with mechanistic interpretability? *
Other than your research task, what are 1-3 pieces of evidence that you'd be able to do good research in the program? Please concisely describe them and why they're relevant. * Aim for 50-100 words, max 200. These don't have to be standard credentials! Unusual backgrounds welcome.
Why are you interested in Neel's stream specifically? *
What is the likelihood you will join Neel Nanda's training program (Feb 2 - March 6), if accepted? *
How did you use LLMs in this research task and write-up? How much did you check their work, and why? See my advice on LLM usage here.
 
 
 
Neel Nanda application (Summer 2026)
Made with Fillout, the best way to make forms, surveys and quizzes your audience will answer.
Neel Nanda application (Summer 2026)
 
 

취사적 정보 수용

notion image
notion image
notion image
 
notion image
 
 

Recommendations