Feature Guided Activation Additions
A technique that creates interpretable and precise steering vectors through density filtering, BOS removal, top feature selection, and linear approximation optimization. It demonstrates superior behavioral control effects and output consistency across most tasks compared to CAA, SAE decoder steering, and SAE-TS.
- Contrastive in SAE space
- Low Density filtering for SAE features
- Remove BOS features
- Top-k select features
- SAE-TS like effect approximater
Interpretable Steering of Large Language Models with Feature Guided...
Effective and reliable control over Large Language Model behavior is a significant challenge. While activation steering methods, which add steering vectors to a model's hidden states, are a...
https://openreview.net/forum?id=swRxS7s4rB

Seonglae Cho