UST D3CODE 2024 Ideagen

It is mandatory for all solutions to be developed as open source.

Scalable

Team Name Neuralland Drive

Who are the Team members? Seonglae Cho - Full Stack AI developer majoring in AI for Sustainable Development at UCL

What is your idea?

Problem statement - AI is getting smarter as they scale, and increasing need to control AI due to safety concerns, such as preventing unintended or harmful behavior

Proposed solution - Activation Engineering with Steering Vector based on LLM feature extraction

Expected impact and benefits - Built Activation Engineering UI and provide a new way to control AI explicitly and precisely

Resource requirements - GPU and Web server to run prototype

How is your idea related to the theme given?

Building ethical frameworks that scale with

AI Governance

Explainability - Mechanistic Interpretability is the most promising field of Explainable AI

Safety - Steering and Controlling AI can make AI safer especially AGI can be an possible threat like a new spices in earth

Transparency - Switching on and off or adding steering vector explicitly change the output which also based on the mathematical theory

Reproducibility - Unlike the Prompt Engineering depends more or the randomness since it use specific natural language, Activation Engineering assures more reproducible results as it uses extracted features from neural network specifically Sparse AutoEncoders

Robustness - Unlike the jailbreaking AI is done by the Prompt Engineering since one prompt can be prioritized than another prompt. Activation Engineering overrides input prompts and ultimately applied by changing internal state of LLMs, it is more robust about malicious input.

Do you know if people need what you're making? (If you believe they do, please explain in as much detail how you know.)

Not even just people, the society requires a controlability of AI as we could not even steer AI when we chat as we intended. AI does sometimes doing very apart from what I requested or do not understand my query. However, this behavior becoming more and more dangerous since Large models tends to hallucinate implicitly more since they know more. When they become smart enough, it could be a threat to human or social stability. Getting used to this interface to controlling AI and concept of Mechanistic Interpretability would definatly helps not only developers who seeks to achieve AI Alignment but also public who’s gonna communicate AI more and more in the future.

Does the product or solution you are building have any competitors? If yes, please identify the names of your competitors. If not, identify one product that may be closest in character to your product or solution. (Check the web, do some quick research before answering this question.)

OpenAI와 Anthropic이 뉴런 분석을 제공하지만 논문 작성용 페이지이다. Deepmind와 협업하는 Neuronpedia가 다양한 모델에 대해 뉴런 분석을 제공하지만 전문가 분석용 또는 연구용이다. 내가 추구하는 바는 누구나 뉴런을 통해 쉽게 AI를 컨트롤할 수 있을 정도 수준의 단순한 대화보다 간단하게 Interface로 제공하는 것이 차별점이다.

What technologies are you considering in developing your product? (Identify the primary programming languages and all other programming languages, database software, connectivity software, software tools, frameworks etc., that you will be using. We understand you may not have a complete list just yet. We just want to make sure you plan to adhere to a free and open-source software stack.)

To achieve best results in short-period Hackerthone, the open-sourced or open-model AI will be used for prototype. Gemma Scope developed from Google AI is main source to analyze Gemma LLMs. I will use a GPU python server with fastAPI and front-end server built with Next.js React. Torchserve would be used for model inference server, ONNX runtime 이랑 Faiss 도 필요하면 사용할 예정

Describe the architecture of your solution? (Provide conceptual, logical, and component views at a minimum)

The prototype’s main purpose is comparing the results between the Activation Engineered output and Prompt Engineering only output.

The prototype will provide intuitive interface which could make people understand activation engineering understandable with precise control. Also the interface will shows the difference of\responses between Prompt Engineering and Activation Engineering.

There will be provided some selectable activation candidates which shows the effect of manipulating activation well.

You can have the last word. (Anything that you think we have not asked you but is important for us to know and account for when reviewing and judging your idea?)

But why do we really need this? AI is getting smarter, and therefore, there is a need to control AI due to safety concerns. In other words, AI safety matters because it might become a threat to humanity. That is why controlling AI is essential. Activation engineering is a highly promising candidate that could provides comprehensive interpretability of the AI model’s internal "black box." I am very excited about my future research on this topic. Finally, stay safe and healthy before the invasion of AI army starts!

Slides Plan

Title, Team Name

Team Members & Roles

Problem statement

AI Safety

Proposed solution

Activation Engineering
Example
Explainability
Safety
Transparency
Reproducibility
Prototype

Expected impact and benefits

questions

Resource requirements

last word

Evaluation Criteria

Below is the evaluation mechanism being used for this event. Please prepare your presentations accordingly.

Evaluation Criteria for Ideas:

Innovation and Creativity – 40 Points

How original and innovative is the idea?
Does it offer a novel solution to the problem?

Impact – 30 Points

What is the potential social impact of the idea?
How many people will benefit from the solution?

Feasibility – 10 Points

Is the idea practical and achievable within the given resources and timeframe?
Are the necessary resources and skills easily available?

Scalability – 10 Points

Can the idea be scaled or replicated in other communities?

Presentation – 10 Points

How well is the idea articulated and presented?

Title

Description

두번째 문단은 완전히 실험 바꾸고, 세번째 문단은 시황에 맞게 적절하게 길게 늘이고, 제목도 그게 맞춰서 바꾸고

heading 1-6 사용해서 훨씬 구조화되고 읽기 쉽게 개조식과 순서도 추가가능 억지로는 하지말고

Steering AI based on Activation Engineering to improve reasoning ability

Artificial Intelligence (AI) is getting smarter, and thus there is an increasing need to control AI due to safety concerns, such as preventing unintended or harmful behavior. Prompt Engineering is the common way to manipulate AI by altering its responses through prompts. For example, in Prompt Engineering, we might guide the AI by adding a prompt like "Please summarize the following text" which influences its response through the input. In contrast, Activation Engineering directly adjusts the internal state of "summarization" feature within the AI's neural network to achieve the desired outcome without modifying the prompt. Recent studies aim to control AI without relying on prompting by extracting features of Large Language Models (LLMs) such as GPT-4 (Gao et al., 2024) and Claude Sonnet (Templeton et al., 2024). For instance, adding a Steering Vector to the neural network alters the responses of LLMs (Konen et al., 2024). These attempts to control AI by manually changing internal states are referred to as "Activation Engineering" (Turner et al., 2024).

However, prior studies have primarily focused on vector extraction, without fully demonstrating practical performance improvement with Steering Vectors. My research aims to address this gap by using Activation Engineering to enhance LLM performance across different benchmarks, particularly in reasoning and question-answering tasks. This research will follow two steps: a) extracting and decomposing vectors from a Large Language Model, and b) comparing the performance of Steering Vectors in Activation Engineering with the traditional Prompt Engineering method. These experiments are planned to be conducted utilizing Gemma Scope developed by Lieberum et al. (2024) and will use a method to find feature vectors automatically based on GPT-4 (Bills et al., 2023). After extraction, we will manually select candidate vectors and map them to benchmarks, aiming for the highest performance metrics. The vector-manipulated LLMs are anticipated to achieve higher performance metrics overall and scores will be analyzed across tasks. Also, finding the optimal combination of vectors will be another experiment, ensuring that activating multiple features does not compromise model capabilities (Turner et al., 2023).

Compared to Prompt Engineering, which appeals to AI to control AI and influences AI behavior indirectly, Activation Engineering shows greater potential to steer AI by directly manipulating internal feature activations. This research will showcase how controlling AI through understanding its internal mechanisms can lead to improved performance.

Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., et al. (2023). ‘Language models can explain neurons in language models’. OpenAI. Available at: https://openai.com/index/language-models-can-explain-neurons-in-language-models/

Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., et al., 2021. A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread. Available at: https://transformer-circuits.pub/2021/framework/index.html.

Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., et al. (2024). ‘Scaling and evaluating sparse autoencoders’. Arxiv. Available at: https://arxiv.org/abs/2406.04093.

Konen, K., Jentzsch, S., Diallo, D., Schütt, P., Bensch, O., El Baff, R., Opitz, D., et al. (2024). ‘Style Vectors for Steering Generative Large Language Models’. in Graham, Y. and Purver, M. (eds) Findings of the Association for Computational Linguistics: EACL 2024. St. Julian’s, Malta: Association for Computational Linguistics, pp. 782–802. Available at: https://aclanthology.org/2024.findings-eacl.52.

Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kramár, J., et al. (2024). ‘Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2’. Arxiv. Available at: https://arxiv.org/abs/2408.05147.

Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., et al. (2024). ‘Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet’. Transformer Circuits Thread. Available at: https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html.

Turner, A., M., M., Udell, D., Thiergart, L. and Mini, U. (2023). ‘Steering GPT-2-XL by adding an activation vector’. AI Alignment Forum. Available at: https://www.alignmentforum.org/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector

Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U. and MacDiarmid, M. (2024). ‘Activation Addition: Steering Language Models Without Optimization’. Arxiv. Available at: https://arxiv.org/abs/2308.10248.

0. Introduction

Artificial Intelligence (AI) is getting smarter, and thus there is an increasing need to control AI due to safety concerns, such as preventing unintended or harmful behavior. Prompt Engineering is the common way to manipulate AI by altering its responses through prompts. For example, in Prompt Engineering, we might guide the AI by adding a prompt like "Please summarize the following text" which influences its response through the input. In contrast, Activation Engineering directly adjusts the internal state of "summarization" feature within the AI's neural network to achieve the desired outcome without modifying the prompt. Recent studies aim to control AI without relying on prompting by extracting features of Large Language Models (LLMs) such as GPT-4 (Gao et al., 2024) and Claude Sonnet (Templeton et al., 2024). For instance, adding a Steering Vector to the neural network alters the responses of LLMs (Konen et al., 2024). These attempts to control AI by manually changing internal states are referred to as "Activation Engineering" (Turner et al., 2024).

Activation Engineering is grounded in Mechanistic Interpretability (Elhage et al., 2021) and can be applied to any Transformer model that adheres to Scaling Laws (Kaplan et al., 2020). This makes it a highly scalable approach for controlling AGI, allowing for precise control over increasingly powerful AI systems.

1. Objective

The goal of this project is to develop an innovative control mechanism for AI models by employing Activation Engineering. This technique manipulates internal activations of neural networks, enabling more explicit and precise steering of AI behavior. The solution focuses on addressing AI safety, explainability, and controllability by offering a robust alternative to Prompt Engineering.

Key Objectives:

To provide a scalable framework to control AI behavior, especially for AGI (Artificial General Intelligence).

To create an intuitive interface that allows users to interact with AI models by directly manipulating their internal features.

To offer improved explainability, safety, reproducibility, transparency, and robustness in AI decision-making.

1.1. Problem Statement

As AI models scale and become increasingly intelligent, the need to control and steer AI behavior grows due to potential safety concerns. These concerns include preventing unintended behavior or ensuring that AI systems do not produce harmful outputs, especially in critical environments where AGI may present a new level of risk to society.

1.2. Proposed Solution

The solution involves utilizing Activation Engineering with Steering Vectors to control AI behavior. Unlike Prompt Engineering, which influences AI indirectly through external inputs, Activation Engineering manipulates the internal activations of large language models (LLMs) to produce a desired outcome. This provides more precise and reliable control over AI behavior.

2. Implementation

2.1 Competitors

OpenAI and Anthropic provide neuron-level analysis tools, primarily for research and technical users. Neuronpedia from DeepMind also targets experts. My solution differentiates by offering an chat interface for non-experts, allowing easy control of AI using Activation Engineering, making AI control more accessible and practical.

2.2 Development Stack

Frontend: Next.js and React for creating an intuitive UI that allows users to visualize and manipulate AI activations.

Backend: FastAPI-powered Python server for hosting AI models and managing activation manipulation.

Model: GPT variants and other LLMs, using steering vectors to modify internal states.

Technologies: Integrates Gemma Scope for analyzing LLMs, Torchserve for model inference, ONNX runtime for efficient processing, and Faiss for large-scale nearest neighbor search if needed.

2.3 Prototype Architecture

The architecture 는 두개의 별도 step? stream? 혹은 workflow로 이루어집니다 하나는 user side이고 다른 하나는 feature side

Chat phase

Activation Modification: Steering vectors are applied to modify activations within the LLM.

User Input: The user submits a query through the frontend UI.

Result Display: The system compares the outcomes of Activation Engineering and traditional Prompt Engineering, showing the differences side-by-side for better understanding.

Analysis phase

Feature Extraction: Sparse AutoEncoders extract important internal features from the conversation

Response Generation: The model produces a response based on these modifications.

Analysis Dashboard: Stats of chat which most activated features

2.4 Resource Requirements

GPU: NVIDIA A100 or V100 with 32GB VRAM for fast LLM inference and activation manipulation.

Web Server: Cloud-based GPU server (AWS/Google Cloud) with 8 vCPUs, 32GB RAM, and 500GB SSD for backend processing.

3. Application

3.1. Expected Impact and Benefits

Ethical Frameworks That Scale with AI Governance

The project strongly aligns with the D3CODE hackathon theme of scalability and ethics by focusing on:

Explainability: Mechanistic Interpretability is central to the project, providing a promising path for understanding how AI models make decisions.

Safety: Activation Engineering offers a mechanism for making AI models safer, especially when handling advanced systems like AGI, which may pose existential risks.

Transparency: By directly manipulating activations and making changes explicit, the system provides a transparent method for controlling AI.

Reproducibility: Unlike Prompt Engineering, which depends on natural language randomness, Activation Engineering provides more reproducible results through mathematical feature manipulation.

Robustness: Activation Engineering overrides input prompts, providing a more robust system for resisting jailbreaking and malicious inputs.

both developers and society at large require a method to control AI effectively. AI systems increasingly exhibit behaviors that deviate from intended instructions, leading to misunderstandings or even harmful consequences. As LLMs grow more powerful, their ability to hallucinate or generate unintended outputs grows, making them potentially dangerous. The Activation Engineering interface provides a clear way for developers and non-experts to manage and interact with AI systems safely and transparently.

3.2. Final Result

At the end of the hackathon, the system will demonstrate:

A fully functional chat UI for Activation Engineering control.

A clear comparison between Prompt Engineering and Activation Engineering outputs.

Scalability in applying this method to various LLMs and scenarios, ensuring widespread applicability across industries.

The Activation Engineering UI provides users and developers with a method to:

Explicitly control AI behaviors by switching on and off specific activation patterns.

Improve AI safety by allowing for robust control, making it possible to prevent harmful actions by AI, particularly in AGI systems.

Ensure transparency through mechanisms that clearly illustrate how internal activations affect outputs.

Enhance reproducibility by bypassing the randomness inherent in Prompt Engineering and relying on more deterministic feature manipulation.

3.3 Future Plans

Improvement : Extending functionality to platform to upload SAEs and upload datasets to fine tune SAE

Recommended feature: auto recommend and activate features from input to improve response performance

Open-Source: Transition the project into an open-source platform for further research and development.

Collaboration: Work with AI governance bodies to deploy this solution in safety-critical domains like healthcare, finance, and autonomous systems.

4. Conclusion

Activation Engineering offers a unique solution for controlling AI, addressing safety, scalability, and ethical concerns. As AI systems become more advanced, having a reliable method to steer their behavior and ensure their alignment with human values will be essential for the future of AI governance.