Anthropic Fellowship 2025

Did you apply to Anthropic, or to work with one of our researchers via programs like MATS or ASTRA within the past year? *

If so, feel free to elaborate on any changes/additions/improvements to your application and experience since you last applied.

Please note: If your rejection was rejected in or after November 2024, we think it's unlikely that this program would be a good fit at this time.

I applied to MATS previously, but since then I have gained noticeable achievements in Mechanistic Interpretability, especially hands-on research experience with SAE (Sparse AutoEncoder).

First, I won the 2024 Holistic AI Research Hackathon in late November, where I demonstrated a proof of concept for debiasing LLMs and red teaming bias using SAE. Building on this research, I am currently developing additional experiments for submission to the ACL 2025 demonstration track.

https://www.canva.com/design/DAGXTdRh__E/MfKS_-4Px4iXNyCT89Sh5g/edit

Secondly, inspired by Neuronpedia, I have been developing the NeuralLand app since October 2023, which offers automated feature steering. I have already provided automated interpretability for over 8,000 neurons by extracting Mistral 8b using a self-trained Mistral SAE. I plan to add features that use LLMs to recommend relevant features based on context and automatically adjust feature activation with policy networks through Reinforcement Learning:

https://github.com/seonglae/neuralland

These two projects have been my biggest progress and deep dive into Mechanistic Interpretability in the 3 months since applying to MATS. Throughout the process, I was able to develop feasible Mechanistic Interpretability ideas into POCs and two successful projects, based on following up papers from LessWrong and the mechanistic interpretability Slack.

In a paragraph or two, why are you interested in participating in this program? *

Anthropic’s Responsible Scaling Policy (RSP) and its level-based AI safety approach provide a precise research incentive to foster the emergence of safe AGI. They are “exporting seat belts” by releasing AI Safety research and spearheading Mechanistic Interpretability. Furthermore, I believe AI Interpretability offers a valuable alignment method, as it shows that neural networks’ intelligence increasingly resembles human cognitive processes.

Another reason is that interpretability research helps achieve democratic AI for both individuals and researchers. Dario Amodei’s vision to use AI for freedom and self-determination involves creating tools to evaluate and mitigate adversarial usages. Likewise, my mechanistic interpretability research uses bottom-up, theory-based methodologies rather than purely empirical approaches, offering a more principled direction for future AI safety.

How likely are you to accept a full-time offer at Anthropic if you receive an offer after the program? (Please include a % and a brief explanation)aHow likely are you to accept a full-time offer at Anthropic if you receive an offer after the program? *

Anthropic's working culture, sharing the vision that we need not just to advance technology but to understand and make it safe, is fabulous. I would say 99%, with 1% being the angel's share.

If you were to start at Anthropic full-time after this program, when is the earliest you could start? (E.g., immediately after the program ends, or some other date) *

immediately after the program ends

How likely are you to continue being interested in working on AI safety after the program? (Please include a % and a brief explanation) *

There is a 99% chance that I will continue working on AI safety throughout my career. I am currently enrolled in a master's program focusing on AI for Sustainable Development. My main resources for AI research are especially inspired by the Anthropic’s Circuit Thread, which have provided many insights about various aspects of AI safety, as shown in my answers regarding research interests.

In what ways are you opinionated on what you work on (if any)? (~1-3 sentences) *

Note: Our mentors for this program have a pretty strong sense of what research to prioritise. While we are open to working on a wide range of research directions in AI safety, if you feel like you already have strong takes on what to work on in ways that seem significantly different to our research priorities, then this program may not be a great fit. Please feel free to flag any uncertainties you have regarding research flexibility & fit!

Emphasis on empiricism: My industry experience and open source contribution experience shows I’m project-driven and curiosity-driven person. When I tackle projects that are creative and distinctive with real-world application, I find the motivation to work tirelessly, day and night. In other words, I prioritize projects with clear real-world impact and value empirical evidence over purely theoretical work while I’m open to adapting my focus to align with broader team goals.

Please select the Fellowship research areas that you’re most interested in or excited about. *

Scalable Oversight

Adversarial Robustness and AI Control

Model Organisms

Interpretability

Feel free to elaborate on your research interests (~3-5 sentences)

I see SAE as a promising method that can be applied to various fields, including Unlearning (Farrell, 2024). Unlearning can be the most direct method to ensure AI safety by removing potentially harmful or misleading data. I am pursuing an MSc thesis on steering LLMs by optimizing SAE features with RL to facilitate unlearning.

Mechanistic interpretability also could be helpful to optimize model via bottom-up analysis. Like how Super-weight (Yu, 2024) methods introduced new quantization techniques, the “dead neurons” from SAE (Bricken, 2023) might enable a fresh approach to model quantization.

Additionally, jailbreaking and red teaming are areas of strong interest, potentially applying RL-based adversarial suffixes (Zou, 2023) to test security and safety.

References

Farrell et al. (2024) Applying Sparse Autoencoders to Unlearn Knowledge in Language Models. arXiv:2410.19278

Yu et al. (2024) The Super Weight in Large Language Models. arXiv:2411.07191

Bricken et al. (2023)Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread

Zou et al. (2023) Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043

Please share code samples from past projects (e.g. GitHub links or uploaded zip files), ideally a substantial project and (if possible) a machine learning project (these could be the same or different projects).

My first ML paper repository with typing and well-managed ML code enabling seamless CLI-based experiments that are easily reproducible

https://github.com/seonglae/RTSum

Second ML research with large code bases for Open Domain Question Answering integrating huggingface platform efficiently for evaluation

https://github.com/seonglae/ReSRer

Code bases related to Mechanistic Interpretability with transformer_lens for inference and hooking residual vectors among transformer layers along with PyTorch hooks

https://github.com/seonglae/mistral-sae

https://github.com/seonglae/neuralland

Holistic AI research hackathon winning project treating SAE and bias mitigation with sae_lens

https://github.com/seonglae/emgsd-hermes

Tiny POC repository to show SAE manipulating Jailbreak using sae_lens

https://github.com/seonglae/jailbreak-sae

Please share at least two references that you would be alright with us reaching out to – ideally people who have worked with you in the past and would have context on your technical ability, strengths, and accomplishment). Make sure to include their name, email, and context on your relationship to them. References from the ML research community are preferred if available. *

By default, we will be contacting references in the next stage without giving you a heads up. Optionally, if you have a reason for wanting us to give you notice before reaching out to your references, please specify your reason and we're happy to respect that. After we give you notice about contacting your references, if we don't hear back from you within a week, we will email your references.

Dongha Lee (donalee@yonsei.ac.kr), Supervisor at Yonsei University

Ilsuk Park (moncher.is@kakaomobility.com), Director at Kakao Mobility & Stryx

Both references have known me for over a year and would be happy to speak about my work. However, in Korean culture, it’s best to notify them in advance, so I’d appreciate a heads-up before you contact them.

We have a designated shared workspace in London where other fellows will work from and mentors will visit. To provide the best experience for our Fellows, we will prioritise candidates who can work from these spaces. Will you be able to work out of the London workspace? If not, please elaborate on why and share where you would like to work from instead. *

I plan to reside in London and prefer to work from the London workspace. I can obtain a two-year graduate visa, so working in London poses no issues.

Would you be able to start full-time in the program in mid-March? If not, please share the earliest date you’d be able to start. *

I can work part-time until June 13 (when Term 3 of my master’s program ends). After June 13, I can commit to a full-time schedule.

Do you have any timelines or deadlines we should be aware of?

I am supposed to finish my MSc thesis between May and June for a conference submission, while the official thesis deadline is in September.

AI Safety Level RSP driving

AI Evaluation of tooling

Nonprofit 아님 for-profit임 while OpenAI is PBC

make less promises and keep more of them

export the seat belt

intelligent optimization problem similar to human brain works by AI interpretability

AI for democracy

share the vision of the need not just to advance the technology but to understand and make safe

Building Anthropic | A conversation with our co-founders

The co-founders of Anthropic discuss the past, present, and future of Anthropic. From left to right: Chris Olah, Jack Clark, Daniela Amodei, Sam McCandlish, Tom Brown, Dario Amodei, and Jared Kaplan. Links and further reading: Anthropic's Responsible Scaling Policy (RSP): https://www.anthropic.com/news/announcing-our-updated-responsible-scaling-policy Machines of Loving Grace: https://darioamodei.com/machines-of-loving-grace Work with us: https://anthropic.com/careers Claude: https://claude.com 00:00 Why work on AI? 02:08 Scaling breakthroughs 03:30 Early days of AI 10:57 Sentiment shifting 18:30 The Responsible Scaling Policy 30:42 Founding story 32:45 Building a culture of trust 39:08 Racing to the top 43:43 Looking to the future

https://www.youtube.com/watch?v=om2lIWXLLN4

Anthropic AI Safety Fellow, London

London, UK

https://boards.greenhouse.io/anthropic/jobs/4379011008?gh_src=LinkedIn

Introducing the Anthropic Fellows Program

We're launching the Anthropic Fellows Program for AI Safety Research, a pilot initiative designed to accelerate AI safety research and foster research talent. The program will provide funding and mentorship for a small cohort of 10-15 Fellows to work full-time on AI safety research. Over the course of six months, Fellows will be matched with Anthropic mentors to investigate AI safety research questions in areas such as Adversarial Robustness, Dangerous Capability Evaluations, and Scalable Oversight.

https://alignment.anthropic.com/2024/anthropic-fellows-program/

Likewise, my mechanistic interpretability research uses bottom-up, theory-based methodologies rather than purely empirical approaches, offering a more principled direction for future AI safety.