Activation Steering
Intervening at later streams produces stronger steering but that modifying the very last residual stream reliably causes broken syntax (Turner, 2024)
SAE-based vectors enable more fine-grained control. Because SAE vectors are designed to be sparse, they minimize the impact on other behaviors, allowing for more precise adjustments.
Steering Vector Methods
Feature Steering Notion
2016 smiling attribute vector
Computed by simply subtracting the mean vector for images without the smile attribute from the mean vector for images with the smile attribute
Latent steering vector from 2022 ACL
Anthropic SAE steering feature vector with limitation and application (2024)
Style vector
For non-expert
In-context learning and Activation Steering are fundamentally the same mechanism. They both modify the model's Belief State about latent concepts. ICL accumulates evidence (likelihood) through contextual examples, while Steering directly shifts the prior probability. The two combine additively in log-belief space → small adjustments can cause sharp behavioral phase shifts. This perspective enables prediction and explanation of phenomena like many-shot ICL's sigmoid learning curve and jailbreak thresholds.

Seonglae Cho
