Did you apply to Anthropic, or to work with one of our researchers via programs like MATS or ASTRA within the past year? *
If so, feel free to elaborate on any changes/additions/improvements to your application and experience since you last applied.
Please note: If your rejection was rejected in or after November 2024, we think it's unlikely that this program would be a good fit at this time.
I applied to MATS previously, but since then I have gained noticeable achievements in Mechanistic Interpretability, especially hands-on research experience with SAE (Sparse AutoEncoder).
First, I won the 2024 Holistic AI Research Hackathon in late November, where I demonstrated a proof of concept for debiasing LLMs and red teaming bias using SAE. Building on this research, I am currently developing additional experiments for submission to the ACL 2025 demonstration track.
Secondly, inspired by Neuronpedia, I have been developing the NeuralLand app since October 2023, which offers automated feature steering. I have already provided automated interpretability for over 8,000 neurons by extracting Mistral 8b using a self-trained Mistral SAE. I plan to add features that use LLMs to recommend relevant features based on context and automatically adjust feature activation with policy networks through Reinforcement Learning:
These two projects have been my biggest progress and deep dive into Mechanistic Interpretability in the 3 months since applying to MATS. Throughout the process, I was able to develop feasible Mechanistic Interpretability ideas into POCs and two successful projects, based on following up papers from LessWrong and the mechanistic interpretability Slack.
In a paragraph or two, why are you interested in participating in this program? *
Anthropic’s Responsible Scaling Policy (RSP) and its level-based AI safety approach provide a precise research incentive to foster the emergence of safe AGI. They are “exporting seat belts” by releasing AI Safety research and spearheading Mechanistic Interpretability. Furthermore, I believe AI Interpretability offers a valuable alignment method, as it shows that neural networks’ intelligence increasingly resembles human cognitive processes.
Another reason is that interpretability research helps achieve democratic AI for both individuals and researchers. Dario Amodei’s vision to use AI for freedom and self-determination involves creating tools to evaluate and mitigate adversarial usages. Likewise, my mechanistic interpretability research uses bottom-up, theory-based methodologies rather than purely empirical approaches, offering a more principled direction for future AI safety.
How likely are you to accept a full-time offer at Anthropic if you receive an offer after the program? (Please include a % and a brief explanation)aHow likely are you to accept a full-time offer at Anthropic if you receive an offer after the program? *
Anthropic's working culture, sharing the vision that we need not just to advance technology but to understand and make it safe, is fabulous. I would say 99%, with 1% being the angel's share.
If you were to start at Anthropic full-time after this program, when is the earliest you could start? (E.g., immediately after the program ends, or some other date) *
immediately after the program ends
How likely are you to continue being interested in working on AI safety after the program? (Please include a % and a brief explanation) *
There is a 99% chance that I will continue working on AI safety throughout my career. I am currently enrolled in a master's program focusing on AI for Sustainable Development. My main resources for AI research are especially inspired by the Anthropic’s Circuit Thread, which have provided many insights about various aspects of AI safety, as shown in my answers regarding research interests.
In what ways are you opinionated on what you work on (if any)? (~1-3 sentences) *
Note: Our mentors for this program have a pretty strong sense of what research to prioritise. While we are open to working on a wide range of research directions in AI safety, if you feel like you already have strong takes on what to work on in ways that seem significantly different to our research priorities, then this program may not be a great fit. Please feel free to flag any uncertainties you have regarding research flexibility & fit!
Emphasis on empiricism: My industry experience and open source contribution experience shows I’m project-driven and curiosity-driven person. When I tackle projects that are creative and distinctive with real-world application, I find the motivation to work tirelessly, day and night. In other words, I prioritize projects with clear real-world impact and value empirical evidence over purely theoretical work while I’m open to adapting my focus to align with broader team goals.
Please select the Fellowship research areas that you’re most interested in or excited about. *
Scalable Oversight
Adversarial Robustness and AI Control
Model Organisms
Interpretability
Feel free to elaborate on your research interests (~3-5 sentences)
I see SAE as a promising method that can be applied to various fields, including Unlearning (Farrell, 2024). Unlearning can be the most direct method to ensure AI safety by removing potentially harmful or misleading data. I am pursuing an MSc thesis on steering LLMs by optimizing SAE features with RL to facilitate unlearning.
Mechanistic interpretability also could be helpful to optimize model via bottom-up analysis. Like how Super-weight (Yu, 2024) methods introduced new quantization techniques, the “dead neurons” from SAE (Bricken, 2023) might enable a fresh approach to model quantization.
Additionally, jailbreaking and red teaming are areas of strong interest, potentially applying RL-based adversarial suffixes (Zou, 2023) to test security and safety.
References
- Farrell et al. (2024) Applying Sparse Autoencoders to Unlearn Knowledge in Language Models. arXiv:2410.19278
- Yu et al. (2024) The Super Weight in Large Language Models. arXiv:2411.07191
- Bricken et al. (2023)Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread
- Zou et al. (2023) Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043
Please share code samples from past projects (e.g. GitHub links or uploaded zip files), ideally a substantial project and (if possible) a machine learning project (these could be the same or different projects).
My first ML paper repository with typing and well-managed ML code enabling seamless CLI-based experiments that are easily reproducible
Second ML research with large code bases for Open Domain Question Answering integrating huggingface platform efficiently for evaluation
Code bases related to Mechanistic Interpretability with
transformer_lens for inference and hooking residual vectors among transformer layers along with PyTorch hooksHolistic AI research hackathon winning project treating SAE and bias mitigation with
sae_lensTiny POC repository to show SAE manipulating Jailbreak using
sae_lensOther links to information about you and your technical contributions (e.g., personal website, LinkedIn, GitHub, Google Scholar, and/or blog posts).
- Linkedin: https://www.linkedin.com/in/seonglae
- Github: https://github.com/seonglae
- Google Scholar: https://scholar.google.com/citations?user=XIMB1PoAAAAJ
- Huggingface https://huggingface.com/seonglae
Two insightful blog posts about each:
Superposition Hypothesis and Sparse AutoEncoder, providing visions like Unlearning
Connecting concepts and discovering consistency across Anthropic’s papers on Phase Change & Feature Dimensionality to explain emergent ability and in-context learning
Please share at least two references that you would be alright with us reaching out to – ideally people who have worked with you in the past and would have context on your technical ability, strengths, and accomplishment). Make sure to include their name, email, and context on your relationship to them. References from the ML research community are preferred if available. *
By default, we will be contacting references in the next stage without giving you a heads up. Optionally, if you have a reason for wanting us to give you notice before reaching out to your references, please specify your reason and we're happy to respect that. After we give you notice about contacting your references, if we don't hear back from you within a week, we will email your references.
- Dongha Lee (donalee@yonsei.ac.kr), Supervisor at Yonsei University
- Ilsuk Park (moncher.is@kakaomobility.com), Director at Kakao Mobility & Stryx
Both references have known me for over a year and would be happy to speak about my work. However, in Korean culture, it’s best to notify them in advance, so I’d appreciate a heads-up before you contact them.
We have a designated shared workspace in London where other fellows will work from and mentors will visit. To provide the best experience for our Fellows, we will prioritise candidates who can work from these spaces. Will you be able to work out of the London workspace? If not, please elaborate on why and share where you would like to work from instead. *
I plan to reside in London and prefer to work from the London workspace. I can obtain a two-year graduate visa, so working in London poses no issues.
Would you be able to start full-time in the program in mid-March? If not, please share the earliest date you’d be able to start. *
I can work part-time until June 13 (when Term 3 of my master’s program ends). After June 13, I can commit to a full-time schedule.
Do you have any timelines or deadlines we should be aware of?
I am supposed to finish my MSc thesis between May and June for a conference submission, while the official thesis deadline is in September.
- AI Safety Level RSP driving
- AI Evaluation of tooling
- Nonprofit 아님 for-profit임 while OpenAI is PBC
- make less promises and keep more of them
- export the seat belt
- intelligent optimization problem similar to human brain works by AI interpretability
- AI for democracy
- share the vision of the need not just to advance the technology but to understand and make safe
Building Anthropic | A conversation with our co-founders
The co-founders of Anthropic discuss the past, present, and future of Anthropic. From left to right: Chris Olah, Jack Clark, Daniela Amodei, Sam McCandlish, Tom Brown, Dario Amodei, and Jared Kaplan.
Links and further reading:
Anthropic's Responsible Scaling Policy (RSP): https://www.anthropic.com/news/announcing-our-updated-responsible-scaling-policy
Machines of Loving Grace: https://darioamodei.com/machines-of-loving-grace
Work with us: https://anthropic.com/careers
Claude: https://claude.com
00:00 Why work on AI?
02:08 Scaling breakthroughs
03:30 Early days of AI
10:57 Sentiment shifting
18:30 The Responsible Scaling Policy
30:42 Founding story
32:45 Building a culture of trust
39:08 Racing to the top
43:43 Looking to the future
https://www.youtube.com/watch?v=om2lIWXLLN4

Anthropic AI Safety Fellow, London
London, UK
https://boards.greenhouse.io/anthropic/jobs/4379011008?gh_src=LinkedIn

Introducing the Anthropic Fellows Program
We're launching the Anthropic Fellows Program for AI Safety Research, a pilot initiative designed to accelerate AI safety research and foster research talent. The program will provide funding and mentorship for a small cohort of 10-15 Fellows to work full-time on AI safety research. Over the course of six months, Fellows will be matched with Anthropic mentors to investigate AI safety research questions in areas such as Adversarial Robustness, Dangerous Capability Evaluations, and Scalable Oversight.
https://alignment.anthropic.com/2024/anthropic-fellows-program/
Likewise, my mechanistic interpretability research uses bottom-up, theory-based methodologies rather than purely empirical approaches, offering a more principled direction for future AI safety.
Seonglae Cho