Steering Vector

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Apr 25 14:14
Editor
Edited
Edited
2026 Jan 1 15:58

Activation Steering

Intervening at later streams produces stronger steering but that modifying the very last residual stream reliably causes broken syntax (Turner, 2024)
SAE-based vectors enable more fine-grained control. Because SAE vectors are designed to be sparse, they minimize the impact on other behaviors, allowing for more precise adjustments.
Steering Vector Methods
 
 
Feature Steering Notion
 
 
 

2016 smiling attribute vector

Computed by simply subtracting the mean vector for images without the smile attribute from the mean vector for images with the smile attribute
arxiv.org
arxiv.org

Latent steering vector from 2022 ACL

Extracting Latent Steering Vectors from Pretrained Language Models
Nishant Subramani, Nivedita Suresh, Matthew Peters. Findings of the Association for Computational Linguistics: ACL 2022. 2022.
Extracting Latent Steering Vectors from Pretrained Language Models

Anthropic SAE steering feature vector with limitation and application (2024)

"Evaluating feature steering: A case study in mitigating social biases"
A new piece of Anthropic research by Durmus et al.: "Evaluating feature steering: A case study in mitigating social biases"
"Evaluating feature steering: A case study in mitigating social biases"

Style vector

Style Vectors for Steering Generative Large Language Models
Kai Konen, Sophie Jentzsch, Diaoulé Diallo, Peer Schütt, Oliver Bensch, Roxanne El Baff, Dominik Opitz, Tobias Hecking. Findings of the Association for Computational Linguistics: EACL 2024. 2024.
Style Vectors for Steering Generative Large Language Models
For non-expert
Steering LLM Behavior Without Fine-Tuning
Modify the behavior or the personality of a model at inference time, without fine-tuning or prompt engineering. Read the blog post 👉 https://huggingface.co/spaces/dlouapre/eiffel-tower-llama Explore SAEs on the Hub 👉 https://huggingface.co/collections/dlouapre/sparse-auto-encoders-saes-for-mechanistic-interpretability Neuronpedia https://www.neuronpedia.org 00:00 Introduction 00:25 Steering as Neurostimulation 02:18 Transformer architecture 04:25 Linear representation of concepts 09:04 Steering using 🤗 transformers 13:43 Finding steering vectors 14:36 Using Sparse AutoEncoders 16:28 Conclusion
Steering LLM Behavior Without Fine-Tuning
huggingface.co
In-context learning
and Activation Steering are fundamentally the same mechanism. They both modify the model's
Belief State
about latent concepts. ICL accumulates evidence (likelihood) through contextual examples, while Steering directly shifts the prior probability. The two combine additively in log-belief space → small adjustments can cause sharp behavioral phase shifts. This perspective enables prediction and explanation of phenomena like many-shot ICL's sigmoid learning curve and jailbreak thresholds.
arxiv.org
 

Recommendations