MATS 2025 (MATS 8.0)

Candidate

MATS 2025s

Name

Description

London

MechInterp

Priority

Control

Fernando Rosas

Description

RL with represent state

London

MechInterp

Priority

Control

Lee Sharkey

Description

SAE MLE

London

MechInterp

Priority

Control

Alex Turner, Alex Cloud

Description

London

MechInterp

Priority

Control

Mary Phuong

Description

CoT scheming in person London

London

MechInterp

Priority

1.5

Control

Ethan Perez, Buck Shlegeris, Samuel Marks, Joe Benton, Evan Hubinger, Mrinank Sharma, Fabien Roger, Kyle Fish, Stephen McAleer, Nicholas Carlini

Description

Anthropic Redwood, project proposal

London

MechInterp

Priority

1.5

Control

Marius Hobbhahn

Description

LM agent scheming black-box monitor that works in general settings with large action spaces (same project, fulltime)

London

MechInterp

Priority

1.5

Control

Adam Shai, Paul Riechers

Description

Simplex residual stream

London

MechInterp

Priority

1.5

Control

Hidenori Tanaka

Description

cognitive alignment with rigorous mathematical framework

London

MechInterp

Priority

1.5

Control

Dawn Song, Yiyou Sun, Xuandong Zhao

Description

AI security and Representation with Donghyeon lee

London

MechInterp

Priority

Control

Samuel Albanie

Description

AI control by prompting and role playing

London

MechInterp

Priority

Control

Tomek Korbak

Description

AI agent scheming

London

MechInterp

Priority

Control

Oliver Sourbut (Oly), Sid Black

Description

AI system eval

London

MechInterp

Priority

Control

Neel Nanda Stream

What problem am I trying to solve? (and a bit on why you think it’s interesting

Remember that what you write is always clearer to yourself than to the reader

What are your high-level takeaways? What were the most interesting parts of your project?

One paragraph and graph per key experiment, giving the gist of what it was, what you found, and why this support your key takeaways.

bullet points, regular summaries, intuitive explanations, concrete examples, pseudocode (or actual code), etc. Good graphs and figures go a long way.

the key thing is for me to “get” the high-level picture of what you did, why, and what you learned, and this is more important than conveying every detail

Start prompt

What's your experience with AI safety? *

Completing AI Safety Fundamentals, MLAB, MLSS, PIBBSS, AI Safety Camp, etc.; - Reading/posting on LessWrong; - Reading The Precipice, The Alignment Problem, Human Compatible, Life 3.0, Superintelligence, etc.; If you have any projects or posts, please provide links when able. If you have no experience with AI safety research in particular, please feel free to leave this section blank.

Projects

My latest research examined how different training conditions of SAEs affect the latent dictionary. I pointed out the importance of the training dataset for feature set transferability and proposed a position feature set in the early layers. This post shows two important things of mine, my detailed understanding of SAE dynamics and effective visualization ability.

https://www.lesswrong.com/posts/ATsvzF77ZsfWzyTak/dataset-sensitivity-in-feature-matching-and-a-hypothesis-on-1

Another project involved winning the Holistic UCL AI Research Hackathon 2024. We developed a full pipeline by extracting SAE latents correlated with a biased classification dataset and then steering the LLM. By identifying the steering vector in those features, we showed how bias could be either mitigated or amplified. I am currently improving this method by integrating SAE-TS with a gradient-based cleanup approach for AxBench control, aiming for a paper publication.

https://github.com/seonglae/emgsd-hermes

https://www.canva.com/design/DAGXTdRh__E/MfKS_-4Px4iXNyCT89Sh5g/edit

Currently, I am pursuing multiple lines of research on mechanistic interpretability of SAEs and investigating model transferability and fine-tuning dynamics.

Literature

By following articles from the Anthropic Circuit Thread, OpenAI Distill circuit, and LessWrong, I wrote two Medium posts discussing SAEs, the Superposition Hypothesis, and Compressed Sensing. After exploring additional sources, I’m now optimistic about the Natural Abstraction Hypothesis, the Universality Hypothesis, and Internal Interface Theory, and I hope to substantiate these in the future.

https://medium.com/p/13cbf8a2f984

https://medium.com/p/c07b74d23e96

What previous research experience do you have? *

Prior research experience in fields like ML/CS/math/philosophy may play a role in your ability to do alignment research. That said, if you have done any research in any field that you are proud of, we would also like to hear about that. Feel free to refer to your LinkedIn/resume.

Before shifting to mechanistic interpretability, I worked on two ML research projects. One aimed to improve Open-Domain Question Answering (ODQA) by increasing information density in context passages through an AI-driven summarization approach. I designed a recursive summarization technique and explored contextual compression techniques such as activation-beacon methods which made me pivot to mechanistic interpretability.

https://github.com/seonglae/ReSRer

My first AI research was published as a demo track paper at NAACL 2024. It proposed a summarization method that preserves factual accuracy by decomposing sentences into relation triples (subject, relation, object) and applying cosine similarity scores within a multi-level graph (sentence, relation, phrase). By recombining the key facts in the summary, it visualized the most important information and reduced hallucination.

https://github.com/seonglae/RTSum

https://aclanthology.org/2024.naacl-demo.5/

1. Hidenori Tanaka stream application (MATS 8.0)

This question was added by the MATS team due to the strong signal it has provided some mentors in the past.

What are 1-3 pieces of evidence that you'd be able to do good research in this stream? (These don't have to be standard credentials!) Please concisely describe them and why they're relevant. Aim for 50-100 words, max 300

My latest work examined how different training conditions in Sparse Autoencoders affect the latent dictionary, employing an explicit mathematical formulation applied directly to the SAE weights (https://www.lesswrong.com/posts/ATsvzF77ZsfWzyTak/dataset-sensitivity-in-feature-matching-and-a-hypothesis-on-1).

With a strong engineering foundation built from 3 years of experience as a Software Engineer, I have published a demonstration paper at NAACL2024 on interpretability research and led a project that won the Holistic UCL AI Research Hackathon 2024. During the hackathon, I developed a complete pipeline to extract SAE latent representations by correlating them with a text classification dataset to guide LLM behavior.

Grounded in the Natural Abstraction Hypothesis (Wentworth, 2021), I believe that LLM representations encode cognitive beliefs. My current research focuses on discovering neural circuits (Conmy, 2023) within hidden states to better understand how these models make decisions and exhibit emergent reasoning abilities. Additionally, cognitive psychology classes I took at Yonsei University and the computational neuroscience modules from UCL Gatsby, I audited during my master's have been instrumental in shaping my research perspective.

2. Fernando Rosas stream application (MATS 8.0)

Which topic(s) would you be excited to work on as part of this stream and why? Feel free to select one of the example project ideas provided or propose your own. [150 words max] *

Recently, Anthropic employed a cross-layer transcoder to discover circuits in computational graphs (Ameisen, 2025). Circuits are the ultimate construct in interpretability, offering a window into the causal structure behind decision-making. However, while existing research analyzes fully trained circuits, it overlooks how emergent reasoning develops during the finetuning. I am excited to address this gap by applying RL to LLMs. My goal is to isolate and identifying the circuits emerge during fine-tuning, such as in chat or logical reasoning capability.

What do you want to work on? Please give a 1-2 paragraph pitch for your research idea that fits this stream. *

Deepseek R1 demonstrated that reinforcement learning can dramatically enhance an LLM’s general reasoning capability, suggesting that latent abilities emerge during inference through RL-driven training. Although current RL techniques for LLMs are typically restricted to single-step updates, there is untapped potential in leveraging the sequential nature of transformer representations. My proposal is to apply RL directly to sequences of hidden states, treating these as dynamic observations. By integrating a control model with PPO for benchmarks like MMLU, we can precisely track the evolution of neural circuits. The policy model will selectively adjust one latent dimension per token, identified via SAE, to reconstruct and optimize the circuit underlying reasoning. Through RL-based training, the objective is to uncover a universal and interpretable circuit that governs dynamics during fine-tuning, providing insights into both emergent behavior and task-specific adaptivity.

What’s your motivation for applying to this stream and project? (max 100 words) *

I am driven by the promise of mechanistic interpretability to reveal the fundamental principles behind AI intelligence rather than merely demystifying black-box models. I firmly believe that advances in this field will fundamentally shape the future of AI. Moreover, exploring RL and test-time interpretability offers an opportunity to unmask latent reasoning processes. This stream aligns with my passion for formalizing neural circuit dynamics in emergent reasoning.

3. Mary Phuong stream application (MATS 8.0)

Pick an example project from the ones listed in the stream description and projects, or propose your own in the same general research area. Write a concrete outline of how you would tackle the project in 150-250 words. What are the riskiest parts of the plan? What would you prioritise doing in the first week? Please be as concrete and specific as possible. (To clarify, you are not committing to a particular project here, it's just an exercise.) *

Current safety alignment approaches, such as Deliberative Alignment (Guan et al., 2024), primarily audit only the final output of chain-of-thought processes. However, emerging evidence from Chen et al. (2025) indicates that reasoning models do not consistently articulate their internal deliberations. To advance toward Deep Safety Alignment (Qi et al., 2024), it is critical to comprehensively audit the entire chain-of-thought process by monitoring the internal vector representations that govern the model’s safety behavior.

I propose integrating the Steering Vector concept, which directly manipulates LLM representations, into the Deliberative Alignment framework. First, we extract steering vectors via contrastive activation additions on a selected contrastive dataset. With these activations, a KL divergence loss is incorporated during the SFT training step. This straightforward modification is expected to produce outputs that align more closely with safety benchmarks by optimizing internal reasoning processes directly. The principal risk is ensuring that the extracted steering vectors precisely capture the intended internal states without introducing bias.

≤ 200 words; answering this is optional but recommended] What's an AI safety paper/post you're excited about? (doesn't need to be your own) Link it, explain why you're excited about it, and describe at least one weakness or limitation you still see. Focus on explaining your view, not summarizing the paper. *

“Towards Monosemanticity” from Bricken et al. (2023) remains my favorite, grounded in the Superposition Hypothesis (Elhage et al., 2022) and Induction Head (Elhage et al., 2021; Olsson et al., 2022) from the same Circuit Thread. I appreciate starting with a mathematical framework, as it solidifies the approach, allowing me the freedom to creatively formulate patterns and predict rules. Observing the generation of the Induction Head and phase change through feature dimensionality, while linking it to in-context learning, feels like an ideal research direction I’m excited to pursue.

Nonetheless, structural interpretability is challenged by dynamic compute-time representations. My initial work with Sparse Autoencoders opened the door to understanding these dynamics; however, SAEs are heavily dataset-dependent and exhibit limitations in feature validation. Recent advancements from Anthropic: cross-layer transcoders in Circuit Tracing (Ameisen et al., 2025; Lindsey et al., 2025), offer promising circuit discovery, but they still face issues with interpretability, complexity, and generalizability. My goal is to bridge test-time and architectural interpretability to enhance model monitoring and control for downstream AI safety tasks.

Would you be available for in-person collaboration in London for the duration of the program? *

Yes, I am finishing my MSc program at UCL, and I'm happy to collaborate in person by forming a team in London.

How much time do you anticipate being able to spend on your MATS project for the duration of the program? *

The program runs from June 16 to August 22. I anticipate working part-time in July and being available full-time during August.

If the project is going well, and you receive funding to continue your work afterward, do you expect to choose to participate in the MATS extension (between 6 and 12 months after the main program)? This could be remote. It’s okay if you aren’t sure if you’ll be available for the entire period. *

I plan to continue my research in AI interpretability and safety without interruption. If my work is well received and the MATS extension proceeds, I would ideally prefer to collaborate in person in London in partnership with DeepMind.

4. Marius Hobbhahn

What’s your motivation for applying to this stream and project? (max 100 words) *

AI scheming is an inevitable challenge as model performance increases. By analogy to human society, where crime often results from elaborate scheming, we can readily anticipate similar risks in advanced AI systems. I propose to address this by extending alignment techniques to explicitly incorporate internal representations. I believe Apollo Research, a leading institute, and Dr. Marius Hobbhahn would be ideal advisors to support this effort, and I am excited to present my project idea.

Why do you think you’d be a good fit for the stream and project? This can be based on prior experience, personal preferences, career aims, or anything else you personally think makes you a good fit. (max 100 words) *

Write a project proposal for the scheming monitoring project (https://docs.google.com/document/d/1FDgh4ioygjqVionP3i8i163zaXT9hqAV-gEikCNhlTE/edit?usp=sharing). The project proposal should be maximally one page (ca. 400 words) excluding references. I expect that the best applications will spend 3-5 hours on this and do a brief empirical investigation in addition to writing. Figures and empirical results are welcome. I encourage you to write it in a Google Doc and copy-paste the full link in the response (make sure the sharing setting is correct and copy-paste the full link since links on text do not work; be careful because links that include the character “_” twice will make stuff italic). If you don’t have the time for a detailed proposal, feel free to write a 100 word description of your proposal. *

black‐box monitoring of scheming

crisp week‑1 experimental plan

evidence question last one

Current safety alignment approaches, such as Deliberative Alignment (Guan et al., 2024), primarily audit only the final output of chain-of-thought processes. Moreover, emerging evidence from Chen et al. (2025) indicates that reasoning models do not consistently articulate their internal deliberations or their scheming capabilities. To advance toward Deep Safety Alignment (Qi et al., 2024), it is critical to comprehensively audit the entire chain-of-thought process by monitoring the internal vector representations that govern the model’s safety behavior.

I propose integrating the Steering Vector concept, which directly manipulates LLM representations, into the Deliberative Alignment framework. First, we extract steering vectors via contrastive activation additions on a selected contrastive dataset. With these activations, a KL divergence loss is incorporated during the SFT training step. The principal risk is ensuring that the extracted steering vectors precisely capture the intended internal states without introducing bias. Overall, this straightforward modification, which considers internal representations, is expected to produce outputs that align more closely with safety benchmarks by directly optimizing intermediate reasoning processes.

5. Lee Sharkey stream application (MATS 8.0)

What do you want to work on? Please give a 3-5 paragraph pitch for your research idea that fits this stream. *

“Towards Monosemanticity” from Bricken et al. (2023) grounded in the Superposition Hypothesis (Elhage et al., 2022) and Induction Head (Elhage et al., 2021; Olsson et al., 2022) from the same Circuit Thread. Starting with a mathematical framework, as it solidifies the approach, allowing me the freedom to creatively formulate patterns and predict rules. Observing the generation of the Induction Head and phase change through feature dimensionality, while linking it to in-context learning.

Nonetheless, structural interpretability is challenged by dynamic compute-time representations. Sparse Autoencoders opened the door to understanding these dynamics; however, SAEs are heavily dataset-dependent and exhibit limitations in feature validation. Recent advancements from Anthropic: cross-layer transcoders in Circuit Tracing (Ameisen et al., 2025; Lindsey et al., 2025), offer promising circuit discovery, but they still face issues with interpretability, complexity, and generalizability. We may bridge test-time and architectural interpretability to enhance model monitoring and control for downstream AI safety tasks.

I propose integrating the Steering Vector concept, which directly manipulates LLM representations, into the Deliberative Alignment framework. First, we extract steering vectors via contrastive activation additions on a selected contrastive dataset. With these activations, a KL divergence loss is incorporated during the SFT training step. The principal risk is ensuring that the extracted steering vectors precisely capture the intended internal states without introducing bias. Overall, this straightforward modification, which considers internal representations, is expected to produce outputs that align more closely with safety benchmarks by directly optimizing intermediate reasoning processes.

What’s your motivation for applying to this stream and project? (max 100 words) *

6. Alex Turner & Alex Cloud stream application (MATS 8.0)

Explain your strongest disagreement with other alignment thinkers. Consider only your own inside-view understanding, and don't defer to others' expertise. *

Propose a follow-up experiment to section 4.2 of the gradient routing paper (https://arxiv.org/abs/2410.04332) and explain the relevance of the experiment to AI safety efforts. (Suggested length: ~250 words for experiment, 1-4 sentences for relevance.) *

Briefly describe your most relevant skills and experience other than AI research (e.g. software engineering, other research, teamwork, writing, or anything else) (max 250 words). *

Are you open to working in a small team? We plan to prioritize people that answer “yes” to this question, although it is not a strict requirement. *

If you’re open to working in a team, briefly reflect on how you might make this go well. What would a great collaboration look like to you? In your answer, you might draw on prior experience in teams, but you don’t have to. (max 250 words).

If accepted, would you participate in the program full-time, in person in Berkeley (Jun 16 - Aug 22)? If you’re not sure, please explain. We are happy to accept people who have other commitments, as long as MATS would be their primary focus. *

Is there anything else we should know? (If the rest of your application speaks for itself, feel free to leave this blank.)

Apply to LASR Labs Summer 2025

Why are you a good fit for LASR Labs? *

I am currently working on my thesis, training a reinforcement‑learning SAE‑based control model that raised Gemma‑2B’s MMLU from 51.9% to 54.6%. I intend to extend this work to extract circuits under varied training regimes. My latest study analysed how training conditions affect Sparse Autoencoder dictionaries, using an explicit mathematical formulation applied to SAE weights (https://www.lesswrong.com/posts/ATsvzF77ZsfWzyTak/dataset-sensitivity-in-feature-matching-and-a-hypothesis-on-1)

Please share 1-3 things you’ve read that have affected your thoughts on loss of control or catastrophic risks related to AI. Please write one sentence for each summarising how they impacted your views. *

“Reasoning Models Don’t Always Say What They Think” confirmed that proper AI auditing requires monitoring internal representations to maintain control.

The Natural Abstraction Hypothesis (Wentworth, 2021) and Shard Theory (Udell, 2022) gave me the insight that achieving AI safety requires neuroscientific understanding through decomposition analysis.

“A Mathematical Framework for Transformer Circuits” taught me that approximate mathematical modeling of neural networks enables the extraction of interpretable units.

Please share a research or engineering project you've worked on that you're particularly proud of and explain your contribution. *

My most recent unpublished project, Faithful SAE, addresses sparse autoencoder limitations by prompting LLMs to self‑generate training datasets that reflect their intrinsic capabilities. This approach enables SAEs to learn robust feature dictionaries directly from model‑driven examples, reducing reliance on external corpora. I implemented and trained 40+ SAEs across architectures including Gemma-2B, LLaMa, and Pythia, following the SAE scaling law (Gao, 2024) and handling synthetic datasets of 100M+ samples.The following questions are logistical questions.

Anything else you would like to add? Or any questions that you have for us?

What makes you a good fit for Cohere?

RAG experience using various sources and MCP product development experience.

Extensive NLP-based service development experience, including AI agent-driven services and detailed customization by forking the Open Deep Research AI Agent.

Proficient in using Python and TypeScript appropriately to develop backend, frontend, and AI algorithms.

Experience as an AI Research Engineer Intern in London at Holistic AI, enabling rapid adaptation

Available to start an internship in Fall 2025

MATS 2025 (MATS 8.0)

Candidate

Neel Nanda Stream

Start prompt

What's your experience with AI safety? *

Projects

Literature

What previous research experience do you have? *

1. Hidenori Tanaka stream application (MATS 8.0)

2. Fernando Rosas stream application (MATS 8.0)

3. Mary Phuong stream application (MATS 8.0)

4. Marius Hobbhahn

5. Lee Sharkey stream application (MATS 8.0)

6. Alex Turner & Alex Cloud stream application (MATS 8.0)

Apply to LASR Labs Summer 2025

Recommendations