Activation Steering
Intervening at later streams produces stronger steering but that modifying the very last residual stream reliably causes broken syntax (Turner, 2024)
SAE-based vectors enable more fine-grained control. Because SAE vectors are designed to be sparse, they minimize the impact on other behaviors, allowing for more precise adjustments.
Steering Vector Methods
Feature Steering Notion
2016 smiling attribute vector
Computed by simply subtracting the mean vector for images without the smile attribute from the mean vector for images with the smile attribute
Latent steering vector from 2022 ACL
Extracting Latent Steering Vectors from Pretrained Language Models
Nishant Subramani, Nivedita Suresh, Matthew Peters. Findings of the Association for Computational Linguistics: ACL 2022. 2022.
https://aclanthology.org/2022.findings-acl.48/
Anthropic SAE steering feature vector with limitation and application (2024)
"Evaluating feature steering: A case study in mitigating social biases"
A new piece of Anthropic research by Durmus et al.: "Evaluating feature steering: A case study in mitigating social biases"
https://www.anthropic.com/research/evaluating-feature-steering

Style vector
Style Vectors for Steering Generative Large Language Models
Kai Konen, Sophie Jentzsch, Diaoulé Diallo, Peer Schütt, Oliver Bensch, Roxanne El Baff, Dominik Opitz, Tobias Hecking. Findings of the Association for Computational Linguistics: EACL 2024. 2024.
https://aclanthology.org/2024.findings-eacl.52/
For non-expert
Steering LLM Behavior Without Fine-Tuning
Modify the behavior or the personality of a model at inference time, without fine-tuning or prompt engineering.
Read the blog post 👉 https://huggingface.co/spaces/dlouapre/eiffel-tower-llama
Explore SAEs on the Hub 👉 https://huggingface.co/collections/dlouapre/sparse-auto-encoders-saes-for-mechanistic-interpretability
Neuronpedia https://www.neuronpedia.org
00:00 Introduction
00:25 Steering as Neurostimulation
02:18 Transformer architecture
04:25 Linear representation of concepts
09:04 Steering using 🤗 transformers
13:43 Finding steering vectors
14:36 Using Sparse AutoEncoders
16:28 Conclusion
https://www.youtube.com/watch?v=F2jd5WuT-zg

In-context learning and Activation Steering are fundamentally the same mechanism. They both modify the model's Belief State about latent concepts. ICL accumulates evidence (likelihood) through contextual examples, while Steering directly shifts the prior probability. The two combine additively in log-belief space → small adjustments can cause sharp behavioral phase shifts. This perspective enables prediction and explanation of phenomena like many-shot ICL's sigmoid learning curve and jailbreak thresholds.

Seonglae Cho