IRG Application

Impact research gropu

Why do you think AI Safety is an important area to focus on? Is there anything that you could learn that would make you think it was less important? (~200 words)

In the long term, the intelligence explosion toward AGI and the proliferation of disinformation generated by generative AI represent immediate threats to democracy. AI biases recursively amplify societal conflicts, while robotics AI or drone AI, which have the potential to be weaponized, could lead to catastrophic consequences. These challenges are not speculative scenarios but pressing realities, and the accelerating pace of AI development starkly contrasts with our inadequate preparation.

One of the gravest concerns is the inherent tendency of increasingly intelligent AI systems to pursue the universal subgoal (gaining control) which is instrumental in achieving almost any overarching objective. This pursuit is likely to conflict with human interests. Addressing this requires a deep understanding of AI decision-making processes and behaviors. Mechanistic Interpretability plays a crucial role in this effort by enabling precise analysis and prediction of AI behavior, helping to mitigate risks.

Achieving explicit control and comprehension of AI through Mechanistic Interpretability would allow us to learn how to better predict AI behavior, enabling the reallocation of resources to other global challenges such as climate change. While uncertainties remain, focusing on interpretability offers a viable path to reduce risks.

What makes you a good candidate for this particular stream? (~300 words)

I bring a combination of technical expertise, project experience, and a passion for Mechanistic Interpretability to this stream. My hands-on work with Sparse Autoencoders (SAE) demonstrates my ability to design and execute research with real-world applications. For instance, I won the 2024 Holistic AI Research Hackathon by showcasing SAE’s potential for debiasing large language models, effectively translating theory into impactful practice.

In addition, I have extensive experience with interpretability programming. I have worked with tools like transformer_lens and developed pipelines for manipulating residual streams. During this process, I directly examined activation distributions at both the token level and SAE feature index level, gaining valuable insights. Building on this foundation, I am pursuing research on multiple-SAE feature steering as part of my MSc thesis.

My theoretical background in Mechanistic Interpretability is equally robust. I have authored blog posts exploring concepts such as the Superposition Hypothesis and Phase Changes, highlighting my ability to synthesize ideas across technical domains and communicate them clearly.

What is your current plan for improving the world, and how could taking part in IRG help you achieve this? (~300 words)

My goal is to advance Mechanistic Interpretability and alignment methods to ensure that powerful AI systems remain safe and beneficial. In my MSc program, I am developing tools to align neural networks with human values, focusing on leveraging Sparse Autoencoders to unlearn harmful or biased features in AI systems.

Participating in IRG offers an unparalleled opportunity to collaborate with leading researchers and refine these goals. The program’s mentorship would enable me to deepen both my theoretical understanding and practical expertise in alignment research. I am particularly interested in exploring scalable oversight methods and applying interpretability techniques to optimize models at scale. Additionally, IRG would enhance my ability to design and execute impactful research projects, such as my MSc thesis on steering LLM behavior through feature manipulation.

How familiar are you with Effective Altruism?

I am moderately familiar with Effective Altruism (EA), which focuses on maximizing positive impact through evidence and reason, grounded in utilitarian principles. I have explored related ideas like Longtermism, which prioritizes ensuring a positive future for humanity.

I understand how EA connects with ethical frameworks like Moral Ambition and its relevance to AI development and Effective Accelerationism (E/acc). While not an active member of the EA community, I strongly resonate with its principles, particularly in relation to AI safety and alignment research.

Impact Research Groups

Impact Research Groups (IRG) is designed to support talented and ambitious students in London who wish to pursue high-impact research careers. Participants will work in small groups with experienced mentors to explore a research question focused on one of our streams. After 8 weeks, the projects will be shared with judges and a winning project will be selected to receive a £2000 prize.

https://www.impactresearchgroups.org/