Steering AI based on Activation Engineering to improve reasoning ability
Artificial Intelligence (AI) is getting smarter, and thus there is an increasing need to control AI due to safety concerns. Prompt Engineering is typical way to manipulate AI to alter output of AI model nowadays. However, several studies are being conducted to control AI without relying on prompting by decomposing features of Large Language Models (LLMs) such as GPT-4 (Gao, 2024) and Claude Sonnet (Templeton, 2024). These findings suggest the possibility of steering AI by extracting features from LLMs and manipulating them. Specifically, adding a Steering Vector to the model’s activation layer alters the responses of LLMs (Konen, 2024). Based on these attempts to control AI by manually altering vectors, similar to approaches for hacking human neuron activation (Jamali, 2024), "Activation Engineering" (Turner, 2023) is an emerging field.
My project will aim to switch performance on diverse range of benchmarks for LLM evaluation metrics such as reasoning and question-answering using Activation Engineering. Two steps are conducted for our research: a) decomposing features from a Large Language Model, and b) executing performance on benchmarks before and after applying Steering Vectors and common prompts as Prompt Engineering. These experiments are conducted on Gemma Scope (Liebrum 2024), and features are automatically found based on GPT-4 (Bills, 2023). After that, we manually selected candidate features and mapped them to benchmarks and highest performance metrics. The manually vector manipulated LLMs show higher performance metrics overall and demonstrated significance in machine translation. Combining multiple features also improved performance in some cases, but activating multiple features sometimes resulted in absurd outputs.
The results would demonstrated that understanding AI internals and using Interpretable AI is practical. Compared to Prompt Engineering is appealing AI to control AI, Activation Engineering shows more potential to precisely steer AI by switching features on and off. In this presentation Activation Engineering showed the possibility to substitute Prompt Engineering and could alter the output of LLMs by showing positively impacted LLM performance.
procedure 에 대해 기반 연구에서 inspiration 얻은 과정 한두문장 설명
Bricken (2023)
Reference list
Steering AI based on Activation Engineering to Improve Reasoning Ability
Artificial Intelligence (AI) is getting smarter, and this is raising the need to control AI due to safety concerns, such as preventing unintended or harmful behavior. One common method to control AI outputs today is Prompt Engineering, where specific prompts are used to steer the responses of the AI model. However, recent studies are exploring how to control AI without relying on prompts by decomposing features of Large Language Models (LLMs) like GPT-4 (Gao et al., 2024) and Claude 3 Sonnet (Templeton et al., 2024). This research suggests that it is possible to steer AI by manipulating feature vectors extracted from LLMs. For example, adding a Steering Vector to the neural network’s activation layer has been shown to alter the responses of LLMs (Konen et al., 2024).
Although previous research has largely focused on feature extraction, there has been less attention on applying Steering Vectors practically. My research aims to address this gap by using Activation Engineering to enhance LLM performance across different benchmarks, particularly in reasoning and question-answering tasks. This project will involve two steps: a) decomposing features from a Large Language Model, and b) evaluating performance on benchmarks before and after applying Steering Vectors, then comparing these results with those achieved through Prompt Engineering.
The experiments will be conducted on Gemma Scope (Lieberum et al., 2024), utilizing a method that automatically identifies features based on GPT-4 (Bills et al., 2023). After feature extraction, I will manually select and map these features to different benchmarks to optimize performance metrics. This process is expected to demonstrate that Activation Engineering can achieve superior results across multiple tasks. Additionally, I will investigate the combination of multiple feature vectors to ensure the model generates coherent and meaningful outputs.
Compared to Prompt Engineering, which influences AI behavior indirectly, Activation Engineering provides a more precise way to steer AI by directly manipulating internal feature activations. This research will showcase how controlling AI through understanding its internal mechanisms can lead to improved performance and more interpretable AI systems.
after comment
Seonglae Cho