AISC 10th Application

What skills and experiences are you bringing?

Take your time answering this question, since it will be the most important question for many projects.

Focus on skills and experiences relevant for the tasks and topics you're interested in (see previous questions) and projects you're interested in (see project proposals).

Imagine you're a research lead reading the answer to this question, trying to figure out if this applicant has the skills needed for the project. What do you (the imaginary research lead) want to know?

Please include relevant links, e.g. GitHub (or other coding sample), things you written, etc.

Feel free to link to a CV document, LinkedIn or similar

Answer

In the last year, there has been much excitement in the Mechanistic Interpretability community about using Sparse Autoencoders (SAEs) to extract monosemantic features. Yet for downstream applications the usage has been much more muted. In a wonderful paper Sparse Feature Circuits, Marks et al. do the only real application of SAEs to solving a useful problem to date (at the time of writing). Yet many of their circuits make significant use of the “error term” from the SAE (i.e. the part of the model’s behaviour that the SAE isn’t well capturing). This isn’t really the fault of Marks et al., it just seems like the underlying features were not effective enough.

We believe that the reason SAEs haven’t been as useful as the excitement suggests is because the SAEs simply aren’t yet good enough at extracting features. Combining ideas from new methods in SAEs with older approaches from the literature, we believe that it’s possible to significantly improve the performance of feature extraction in order to allow SAE-style approaches to be more effective.

We would like to make progress towards truly understanding features: how we ought to extract features, how features relate to each other and perhaps even what “features” are.

Required skills:

Everyone:

Understand what SAEs are (e.g. have read Towards Monosemanticity or my blog post). If you haven’t read either of these yet, go have a look! Just want to check you find the SAE problem interesting
Has taken >=1 course on Linear Algebra or has watched the 3b1b Linear Algebra videos
Has some understanding of how transformers work (e.g. at least one of: have watched the 3b1b Neural Network videos, have built a transformer from scratch, have read A Mathematical Framework For Transformer Circuits etc.)
Like giving clear feedback - you feel able to read a paper/blog post draft and say “hm i didn’t like that part because X”
Enjoy (or think you would enjoy) collaborating with others

Engineers:

Comfortable with PyTorch (or Jax) but don’t need to have done MechInterp research before
Has some code which is open source (or is not public but are willing to share). Doesn’t need to be anything fancy or even ML - a class project, some weekend curiosity notebook or anything works! Alternatively have worked as a software engineer
Strongly encouraged: Likes types in Python

Theorists/Distillators

Able to code-switch between formalism and informality
Desire to clarify conceptual confusion
Background in mathematics, statistical theory, physics or analytic philosophy

Diverse and interesting skills (nice to have and definitely apply if you have them but not necessary!):

Philosophical background

Interest in category theory, model theory or functional analysis

Strong in geometry, topology, information theory or information geometry

3년간 software engineer로 개발경력 있고 python, node web, vector database, docker and linux 다양한 스킬셋 보유

CS bachelers 4학년때 1년간 NLP 인턴 연구경력과 naacl 1저자 system demo paper publish경험 있는 런던 UCL msc student

작년부터 Circuit Thread, lesswrong의 팔로업하며 A Methematical Framework for Transformer Circuit, Towards monosemanticity, Toy models for superposition 그리고 In-context Learning and Induction Heads 내용을 천천히 끝가지 정독하여 읽고 블로그 포스트로 정리함

올해 4월 compressed sensing개념을 연결지어서 superposition hypothesis 와 phase change 에 대해 mechanistic interpreatbilyit로 위 개념들을 공부하며 스토리텔링함 https://seongland.medium.com/superposition-hypothesis-for-steering-llm-with-sparse-autoencoder-c07b74d23e96과 https://seongland.medium.com/reversing-transformer-to-understand-in-context-learning-with-phase-change-feature-dimensionality-13cbf8a2f984에

재작년 aI service인 https://mbti.texonom.com/ 을 개발였는데 웹 디자인 능력도 출중하며 software engineer 당시 능력 보여줌

올해 4월 pytorch hook을 레이어에 inference도중 사용해 inference 도충 activation engineering을 이용한 다양한 poc 진행
test.py
seonglae

올해 10월
mistral-sae
seonglae • Updated 2024 Nov 18 22:35
에서 mistral sae 누가 implemented 한 open source에서 개선하여 automated interpretbility 기능 추가하고 pixtral 12b 에도 적용을 위해 코드 작성중이다

neuralland
seonglae • Updated 2024 Dec 19 0:32
에서 neuronpedia 같이 sae service를 만드는데, chat 대화에 따라 activating할 feature를 추천해주는 서비스를 개발중

https://texonom.com/e40305fd878c46ca85d99ea93ee9a2ff 와 같이 mechanistic interpretability는 물론 에서 blog about ml 진행중

LinkedIn https://www.linkedin.com/in/seonglae/

Gitub https://github.com/seonglae

여러 모델 distributed 트레이닝 경험과 대용량 데이터셋 다루는 것에 익숙하고 dataset streaming or multiprocessing등 최적화 익숙

CV: https;//seongland.com/cv.pdf

어떤 우월성 없이 인간 지능과 인공 지능이 2가지 다른 형태의 intelligence라는 것으로 여기고 둘다 statistical machine이라고 생각함

transformer에 대해 수학적인 이해를 가지고 einsum, kronecker product로 직접 구현해보았다

python typing 을 기반한 협업을 선호하고 다양한 typing 경험이 있다

Clear feedback을 주는 것을 즐기고 연구실 인턴당시 여러 팀프로젝트 경험이 있다

When thinking about AI risk, what are you most concerned about and why?

You can also talk about current increasing harms of AI.

Large Language Models have enabled the generalization of numerous downstream tasks within a single model, but their high degree of freedom introduces risks. While this generalized intelligence, my primary concern is not the autonomy of AI but the scope of its use, particularly AI control. For instance, if major AI services subtly insert advertisements into responses or apply implicit political indoctrination through activation engineering, it could have profound negative impacts on society. Paradoxically, these threats may arise due to advancements in mechanistic interpretability and activation engineering, which can be used to manipulate models in intended ways. I am particularly concerned about these misuse cases and believe that research focused on safe AI control is crucial to mitigate such risks.

What ways to reduce AI risk do you think are more or less promising? Why do you think so?

I adhere to the Natural Abstraction Hypothesis, which suggests that AI abstracts the real world in ways similar to humans. Given this, I am skeptical that adjusting AI incentives can effectively prevent Instrumental Convergence. Instead, I believe that analyzing the induced incentives though training of transformer models through mechanistic interpretability to minimize phenomena such as Waluigi Effect is a more promising approach. Practically, activation engineering using Sparse AutoEncoders has shown potential in understanding and controlling models with Steering Vectors. While this method can be commercially viable, it does not fundamentally eliminate harmful capabilities within the models. Therefore, I advocate for increasing interpretability during the training phase to inducing incentives in safer directions. This proactive approach during training may offer a more robust solution to AI risks compared to post-hoc control mechanisms.

What other time commitments do you have during Jan - Apr 2025, and how much time will be taken up by these commitments?

As a full-time Master's student at UCL, I will be in my second term, with classes from Tuesday to Friday.

How much time will you have for AISC?

Include how many hours per week you can commit to AISC for the duration of the program (Jan - April).

I can do full-time commitment for Monday, Saturday and Sunday. I honestly think I can commit half of the weekends for several events. So I might can conclude 15~20 hours a week for AISC program. I can commit approximately 15 to 20 hours per week to the AISC program. I am available full-time on Mondays, Saturdays, and Sundays. However, realistically, I anticipate that some weekends may have other events. I have classes from Tuesday to Friday, but I can also allocate additional hours in the evenings during weekdays. I am willing to dedicate extra time as needed to contribute effectively to the program.

What skills and experiences are you bringing?

Take your time answering this question, since it will be the most important question for many projects.

Focus on skills and experiences relevant for the tasks and topics you're interested in (see previous questions) and projects you're interested in (see project proposals).

Imagine you're a research lead reading the answer to this question, trying to figure out if this applicant has the skills needed for the project. What do you (the imaginary research lead) want to know?

Please include relevant links, e.g. GitHub (or other coding sample), things you written, etc.

Feel free to link to a CV document, LinkedIn or similar

Use bullet points

Experience

I have 3 years of development experience as a software engineer with a diverse skill set including Python, Node.js web development, vector databases, Docker, and Linux.

As a UCL AI MSc student in London with a Bachelor's degree in Computer Science, I have one year of NLP research internship experience during my senior year and published a first-author system demo paper at NAACL 2024.

I shared concepts of Compressed Sensing to introduce the Superposition Hypothesis and Phase Change in mechanistic interpretability, storytelling by giving a scenario about AI safety in my following blog post: https://seongland.medium.com/superposition-hypothesis-for-steering-llm-with-sparse-autoencoder-c07b74d23e96

Since last year, I have been following the Circuit Thread on LessWrong, thoroughly reading and summarizing papers such as "A Mathematical Framework for Transformer Circuits", "Towards Monosemanticity", "Toy Models for Superposition", and "In-context Learning and Induction Heads", and organizing them into the following blog post. https://seongland.medium.com/reversing-transformer-to-understand-in-context-learning-with-phase-change-feature-dimensionality-13cbf8a2f984

Skills

One year ago, I developed an AI service https://mbti.texonom.com/, demonstrating my web design and implementation abilities.

In April this year, I conducted PyTorch Hook proofs of concept applying to layers during inference: https://github.com/seonglae/treensformer/blob/main/test.py

In October this year, I forked an open-source implementation of Mistral SAE at https://github.com/seonglae/mistral-sae, adding automated interpretability features and currently writing code to apply it to PixTral 12B using TransformerLens & SAELens.

I am developing a service similar to Neuronpedia at https://github.com/seonglae/neuralland, creating an SAE service that recommends features to activate based on chat conversations (currently refactoring into NNSight).

I experienced in optimizing tasks for heavy jobs like distributed training of LLaMa and am familiar with handling large-scale datasets, such as dataset streaming and multiprocessing.

I have a mathematical understanding of transformers and have implemented them in several ways using each einsum and Kronecker products.

Philosophy

I consider human intelligence and artificial intelligence as two different forms of intelligence without any superiority, and I think of both as statistical machines.

I am organizing all of my technical writing and concepts about ML and mechanistic interpretability at https://texonom.com/e40305fd878c46ca85d99ea93ee9a2ff as Zettelkasten.

I prefer collaboration based on Python typing and have experience with various typing practices.

I have experience with multiple team projects during my research internship and enjoy providing clear feedback.

Links

CV: https://seongland.com/cv.pdf

LinkedIn: https://www.linkedin.com/in/seonglae/

GitHub: https://github.com/seonglae

(14) Why do you want to join this project?

SAEs are undoubtedly an amazing discovery, but through some proofs of concept, I have encountered limitations in using SAEs practically. I want to demonstrate that mechanistic interpretability can be practical and contribute meaningfully to AI control, especially AI safety. Having firsthand experience with the current limitations of SAEs, I am motivated to improve them. The MDL SAE paper provided valuable insights, highlighting how the overcomplete basis of SAEs leads to multiple interpretations and emphasizing the importance of managing sparsity. This emphasizes the role of managing sparsity, and according to the Natural Abstraction Hypothesis, both humans and LLMs are optimized to share dimensions through Compressed Sensing for efficiency. I agree with the approach of solving these issues by effectively utilizing sparsity. Leveraging my engineering skills, I aim to conduct extensive experiments to observe how SAEs operate in this context. I believe that my practical experience can contribute to the project.

(14) If the project fails, what do you think would be the most likely reason?

First, if the primary issue with SAEs does not stem from the overcomplete basis creating multiple interpretations, then improving sparsity allocation might not resolve key challenges. Second, structural limitations within SAEs might hinder the extracted features from developing sufficiently hierarchical feature geometry. Finally, the real-world phenomena modeled by SAEs might lack a clear hierarchical structure, complicating the effectiveness of our approaches.

(15) Why do you want to join this project? What about this proposal resonated the most with you?

I want to contribute to the paradigm shift from big data-driven science to hypothesis-driven science. Your emphasis on tackling foundational issues and moving beyond current hype cycles resonates with me. Also, I am particularly impressed by how weight interpretability can be derived from first principles in your previous project. I am excited about the opportunity to critically analyze assumptions that are taken for granted in SAEs, examining them one by one to contribute to a more rigorous approach.

(15) (Optional) Collaboration opportunity before AISC

I have extensive implementation experience in RL methods, ranging from policy gradients to actor-critic algorithms and offline RL such as PPO, DQN, and CQL with understanding of model-based RL and reward modeling. I am keen to contribute to the project.

ChatGPT

A conversational AI system that listens, learns, and challenges