AISC 10th Application

Created
Created
2024 Nov 18 13:40
Creator
Creator
Seonglae ChoSeonglae Cho
Editor
Edited
Edited
2024 Nov 18 22:34
Refs
Refs

What skills and experiences are you bringing?

  • Take your time answering this question, since it will be the most important question for many projects.
  • Focus on skills and experiences relevant for the tasks and topics you're interested in (see previous questions) and projects you're interested in (see project proposals).
  • Imagine you're a research lead reading the answer to this question, trying to figure out if this applicant has the skills needed for the project. What do you (the imaginary research lead) want to know?
  • Please include relevant links, e.g. GitHub (or other coding sample), things you written, etc.
  • Feel free to link to a CV document, LinkedIn or similar
Answer
In the last year, there has been much excitement in the Mechanistic Interpretability community about using Sparse Autoencoders (SAEs) to extract monosemantic features. Yet for downstream applications the usage has been much more muted. In a wonderful paper Sparse Feature Circuits, Marks et al. do the only real application of SAEs to solving a useful problem to date (at the time of writing). Yet many of their circuits make significant use of the “error term” from the SAE (i.e. the part of the model’s behaviour that the SAE isn’t well capturing). This isn’t really the fault of Marks et al., it just seems like the underlying features were not effective enough.
We believe that the reason SAEs haven’t been as useful as the excitement suggests is because the SAEs simply aren’t yet good enough at extracting features. Combining ideas from new methods in SAEs with older approaches from the literature, we believe that it’s possible to significantly improve the performance of feature extraction in order to allow SAE-style approaches to be more effective.
We would like to make progress towards truly understanding features: how we ought to extract features, how features relate to each other and perhaps even what “features” are.
Required skills:
  • Engineers:
    • Comfortable with PyTorch (or Jax) but don’t need to have done MechInterp research before
    • Has some code which is open source (or is not public but are willing to share). Doesn’t need to be anything fancy or even ML - a class project, some weekend curiosity notebook or anything works! Alternatively have worked as a software engineer
    • Strongly encouraged: Likes types in Python
  • Theorists/Distillators
    • Able to code-switch between formalism and informality
    • Desire to clarify conceptual confusion
    • Background in mathematics, statistical theory, physics or analytic philosophy
Diverse and interesting skills (nice to have and definitely apply if you have them but not necessary!):
  • Philosophical background
  • Interest in category theory, model theory or functional analysis
  • Strong in geometry, topology, information theory or information geometry
 
 
  • 3년간 software engineer로 개발경력 있고 python, node web, vector database, docker and linux 다양한 스킬셋 보유
  • CS bachelers 4학년때 1년간 NLP 인턴 연구경력과 naacl 1저자 system demo paper publish경험 있는 런던 UCL msc student
  • 작년부터 Circuit Thread, lesswrong의 팔로업하며 A Methematical Framework for Transformer Circuit, Towards monosemanticity, Toy models for superposition 그리고 In-context Learning and Induction Heads 내용을 천천히 끝가지 정독하여 읽고 블로그 포스트로 정리함
  • 재작년 aI service인 https://mbti.texonom.com/ 을 개발였는데 웹 디자인 능력도 출중하며 software engineer 당시 능력 보여줌
  • 올해 4월 pytorch hook을 레이어에 inference도중 사용해 inference 도충 activation engineering을 이용한 다양한 poc 진행
    test.py
    seonglae
  • Gitub https://github.com/seonglae
  • 여러 모델 distributed 트레이닝 경험과 대용량 데이터셋 다루는 것에 익숙하고 dataset streaming or multiprocessing등 최적화 익숙
  • CV: https;//seongland.com/cv.pdf
  • 어떤 우월성 없이 인간 지능과 인공 지능이 2가지 다른 형태의 intelligence라는 것으로 여기고 둘다 statistical machine이라고 생각함
  • transformer에 대해 수학적인 이해를 가지고 einsum, kronecker product로 직접 구현해보았다
  • python typing 을 기반한 협업을 선호하고 다양한 typing 경험이 있다
  • Clear feedback을 주는 것을 즐기고 연구실 인턴당시 여러 팀프로젝트 경험이 있다
 
 
 

When thinking about AI risk, what are you most concerned about and why?

You can also talk about current increasing harms of AI.
Large Language Models have enabled the generalization of numerous downstream tasks within a single model, but their high degree of freedom introduces risks. While this generalized intelligence, my primary concern is not the autonomy of AI but the scope of its use, particularly AI control. For instance, if major AI services subtly insert advertisements into responses or apply implicit political indoctrination through activation engineering, it could have profound negative impacts on society. Paradoxically, these threats may arise due to advancements in mechanistic interpretability and activation engineering, which can be used to manipulate models in intended ways. I am particularly concerned about these misuse cases and believe that research focused on safe AI control is crucial to mitigate such risks.

What ways to reduce AI risk do you think are more or less promising? Why do you think so?

I adhere to the Natural Abstraction Hypothesis, which suggests that AI abstracts the real world in ways similar to humans. Given this, I am skeptical that adjusting AI incentives can effectively prevent Instrumental Convergence. Instead, I believe that analyzing the induced incentives though training of transformer models through mechanistic interpretability to minimize phenomena such as Waluigi Effect is a more promising approach. Practically, activation engineering using Sparse AutoEncoders has shown potential in understanding and controlling models with Steering Vectors. While this method can be commercially viable, it does not fundamentally eliminate harmful capabilities within the models. Therefore, I advocate for increasing interpretability during the training phase to inducing incentives in safer directions. This proactive approach during training may offer a more robust solution to AI risks compared to post-hoc control mechanisms.

What other time commitments do you have during Jan - Apr 2025, and how much time will be taken up by these commitments?

As a full-time Master's student at UCL, I will be in my second term, with classes from Tuesday to Friday.

How much time will you have for AISC?

Include how many hours per week you can commit to AISC for the duration of the program (Jan - April).
 
 
I can do full-time commitment for Monday, Saturday and Sunday. I honestly think I can commit half of the weekends for several events. So I might can conclude 15~20 hours a week for AISC program. I can commit approximately 15 to 20 hours per week to the AISC program. I am available full-time on Mondays, Saturdays, and Sundays. However, realistically, I anticipate that some weekends may have other events. I have classes from Tuesday to Friday, but I can also allocate additional hours in the evenings during weekdays. I am willing to dedicate extra time as needed to contribute effectively to the program.

What skills and experiences are you bringing?

  • Take your time answering this question, since it will be the most important question for many projects.
  • Focus on skills and experiences relevant for the tasks and topics you're interested in (see previous questions) and projects you're interested in (see project proposals).
  • Imagine you're a research lead reading the answer to this question, trying to figure out if this applicant has the skills needed for the project. What do you (the imaginary research lead) want to know?
  • Please include relevant links, e.g. GitHub (or other coding sample), things you written, etc.
  • Feel free to link to a CV document, LinkedIn or similar
  • Use bullet points

Experience

  • I have 3 years of development experience as a software engineer with a diverse skill set including Python, Node.js web development, vector databases, Docker, and Linux.
  • As a UCL AI MSc student in London with a Bachelor's degree in Computer Science, I have one year of NLP research internship experience during my senior year and published a first-author system demo paper at NAACL 2024.

Skills

  • In October this year, I forked an open-source implementation of Mistral SAE at https://github.com/seonglae/mistral-sae, adding automated interpretability features and currently writing code to apply it to PixTral 12B using TransformerLens & SAELens.
  • I am developing a service similar to Neuronpedia at https://github.com/seonglae/neuralland, creating an SAE service that recommends features to activate based on chat conversations (currently refactoring into NNSight).
  • I experienced in optimizing tasks for heavy jobs like distributed training of LLaMa and am familiar with handling large-scale datasets, such as dataset streaming and multiprocessing.
  • I have a mathematical understanding of transformers and have implemented them in several ways using each einsum and Kronecker products.

Philosophy

  • I consider human intelligence and artificial intelligence as two different forms of intelligence without any superiority, and I think of both as statistical machines.
  • I prefer collaboration based on Python typing and have experience with various typing practices.
  • I have experience with multiple team projects during my research internship and enjoy providing clear feedback.

Links

(14) Why do you want to join this project?

SAEs are undoubtedly an amazing discovery, but through some proofs of concept, I have encountered limitations in using SAEs practically. I want to demonstrate that mechanistic interpretability can be practical and contribute meaningfully to AI control, especially AI safety. Having firsthand experience with the current limitations of SAEs, I am motivated to improve them. The MDL SAE paper provided valuable insights, highlighting how the overcomplete basis of SAEs leads to multiple interpretations and emphasizing the importance of managing sparsity. This emphasizes the role of managing sparsity, and according to the Natural Abstraction Hypothesis, both humans and LLMs are optimized to share dimensions through Compressed Sensing for efficiency. I agree with the approach of solving these issues by effectively utilizing sparsity. Leveraging my engineering skills, I aim to conduct extensive experiments to observe how SAEs operate in this context. I believe that my practical experience can contribute to the project.

(14) If the project fails, what do you think would be the most likely reason?

First, if the primary issue with SAEs does not stem from the overcomplete basis creating multiple interpretations, then improving sparsity allocation might not resolve key challenges. Second, structural limitations within SAEs might hinder the extracted features from developing sufficiently hierarchical feature geometry. Finally, the real-world phenomena modeled by SAEs might lack a clear hierarchical structure, complicating the effectiveness of our approaches.

(15) Why do you want to join this project? What about this proposal resonated the most with you?

I want to contribute to the paradigm shift from big data-driven science to hypothesis-driven science. Your emphasis on tackling foundational issues and moving beyond current hype cycles resonates with me. Also, I am particularly impressed by how weight interpretability can be derived from first principles in your previous project. I am excited about the opportunity to critically analyze assumptions that are taken for granted in SAEs, examining them one by one to contribute to a more rigorous approach.

(15) (Optional) Collaboration opportunity before AISC

I have extensive implementation experience in RL methods, ranging from policy gradients to actor-critic algorithms and offline RL such as PPO, DQN, and CQL with understanding of model-based RL and reward modeling. I am keen to contribute to the project.
 
 
 
 
ChatGPT
A conversational AI system that listens, learns, and challenges
ChatGPT
 
 

Recommendations