Scalable AI Steering: Activation Engineering for Safer and Transparent AGI
0. Introduction
Artificial Intelligence (AI) is getting smarter, and thus there is an increasing need to control AI due to safety concerns, such as preventing unintended or harmful behavior. Prompt Engineering is the common way to manipulate AI by altering its responses through prompts. For example, in Prompt Engineering, we might guide the AI by adding a prompt like "Please summarize the following text" which influences its response through the input. In contrast, Activation Engineering directly adjusts the internal state of "summarization" feature within the AI's neural network to achieve the desired outcome without modifying the prompt. Recent studies aim to control AI without relying on prompting by extracting features of Large Language Models (LLMs) such as GPT-4 (Gao et al., 2024) and Claude Sonnet (Templeton et al., 2024). For instance, adding a Steering Vector to the neural network alters the responses of LLMs (Konen et al., 2024). These attempts to control AI by manually changing internal states are referred to as "Activation Engineering" (Turner et al., 2024).
Activation Engineering is grounded in Mechanistic Interpretability (Elhage et al., 2021) and can be applied to any Transformer model that adheres to Scaling Laws (Kaplan et al., 2020). This makes it a highly scalable approach for controlling AGI, allowing for precise control over increasingly powerful AI systems.
1. Objective
The goal of this project is to develop an innovative control mechanism for AI models by employing Activation Engineering. This technique manipulates internal activations of neural networks, enabling more explicit and precise steering of AI behavior. The solution focuses on addressing AI safety, explainability, and controllability by offering a robust alternative to Prompt Engineering.
Key Objectives:
- To provide a scalable framework to control AI behavior, especially for AGI (Artificial General Intelligence).
- To create an intuitive interface that allows users to interact with AI models by directly manipulating their internal features.
- To offer improved explainability, safety, reproducibility, transparency, and robustness in AI decision-making.
1.1. Problem Statement
As AI models scale and become increasingly intelligent, the need to control and steer AI behavior grows due to potential safety concerns. These concerns include preventing unintended behavior or ensuring that AI systems do not produce harmful outputs, especially in critical environments where AGI may present a new level of risk to society.
1.2. Proposed Solution
The solution involves utilizing Activation Engineering with Steering Vectors to control AI behavior. Unlike Prompt Engineering, which influences AI indirectly through external inputs, Activation Engineering manipulates the internal activations of large language models (LLMs) to produce a desired outcome. This provides more precise and reliable control over AI behavior.
2. Implementation
2.1 Democratizing AI Compared to Competitors
Complex tools, simplifying AI control for non-experts
OpenAI and Anthropic provide neuron-level analysis tools, primarily for research and technical users. Neuronpedia from DeepMind also targets experts. My solution differentiates by offering an chat interface for non-experts, allowing easy control of AI using Activation Engineering, making AI control more accessible and practical.
2.2 Prototype Architecture
The architecture consists of two separate phases: Chat Phase and Analysis Phase.
Chat Phase
- User Input: The user submits a query via the frontend UI.
- Activation Modification: Steering vectors modify activations within the LLM.
- Result Display: Compares outcomes of Activation Engineering and traditional Prompt Engineering side-by-side.
Analyze Phase
- Feature Extraction: Sparse AutoEncoders extract key internal features from the conversation.
- Activation Analysis: Analyzes modified activations to understand their effect on the model's behavior.
- Analysis Dashboard: Displays activation statistics such as the most activated features.
2.3 Development Stack
- Frontend: Next.js and React for creating an intuitive UI that allows users to visualize and manipulate AI activations.
- Backend: FastAPI-powered Python server for hosting AI models and managing activation manipulation.
- Model: GPT variants and other LLMs, using steering vectors to modify internal states.
- Technologies: Integrates Gemma Scope for analyzing LLMs, Torchserve for model inference, ONNX runtime for efficient processing, and Faiss for large-scale nearest neighbor search if needed.
2.4 Resource Requirements
- GPU: NVIDIA A100 or V100 with 32GB VRAM for fast LLM inference and activation manipulation.
- Web Server: Cloud-based GPU server (AWS/Google Cloud) with 8 vCPUs, 32GB RAM, and 500GB SSD for backend processing.
3. Application
3.1. Expected Impact and Benefits
Ethical Frameworks That Scale with AI Governance
The project strongly aligns with the D3CODE hackathon theme of scalability and ethics by focusing on:
- Explainability: Mechanistic Interpretability is central to the project, providing a promising path for understanding how AI models make decisions.
- Safety: Activation Engineering offers a mechanism for making AI models safer, especially when handling advanced systems like AGI, which may pose existential risks.
- Transparency: By directly manipulating activations and making changes explicit, the system provides a transparent method for controlling AI.
- Reproducibility: Unlike Prompt Engineering, which depends on natural language randomness, Activation Engineering provides more reproducible results through mathematical feature manipulation.
- Robustness: Activation Engineering overrides input prompts, providing a more robust system for resisting jailbreaking and malicious inputs.
both developers and society at large require a method to control AI effectively. AI systems increasingly exhibit behaviors that deviate from intended instructions, leading to misunderstandings or even harmful consequences. As LLMs grow more powerful, their ability to hallucinate or generate unintended outputs grows, making them potentially dangerous. The Activation Engineering interface provides a clear way for developers and non-experts to manage and interact with AI systems safely and transparently.
3.2. Final Result
Solution Demonstration
- A fully functional chat UI for Activation Engineering control.
- A clear comparison between Prompt Engineering and Activation Engineering outputs.
- Scalability in applying this method to various LLMs and scenarios, ensuring widespread applicability across industries.
Social Impact
- Explicitly control AI behaviors by switching on and off specific activation patterns.
- Improve AI safety by allowing for robust control, making it possible to prevent harmful actions by AI, particularly in AGI systems.
- Ensure transparency through mechanisms that clearly illustrate how internal activations affect outputs.
- Enhance reproducibility by bypassing the randomness inherent in Prompt Engineering and relying on more deterministic feature manipulation.
3.3 Future Plans
- Recommendation Feature: Automatically recommend and activate features from input to improve response performance.
- Platformization: Extend the system to allow users to upload SAEs and datasets for fine-tuning and model customization.
- Open-Source: Transition the project into an open-source platform for further research and development.
- Collaboration: Work with AI governance bodies to deploy this solution in safety-critical domains like healthcare, finance, and autonomous systems.
4. Conclusion
Activation Engineering offers a unique solution for controlling AI, addressing safety, scalability, and ethical concerns. As AI systems become more advanced, having a reliable method to steer their behavior and ensure their alignment with human values will be essential for the future of AI governance.
5. References
References have been included on the final slide due to character limitations.
Seonglae Cho