Feature Guided Activation Additions
A technique that creates interpretable and precise steering vectors through density filtering, BOS removal, top feature selection, and linear approximation optimization. It demonstrates superior behavioral control effects and output consistency across most tasks compared to CAA, SAE decoder steering, and SAE-TS.
- Contrastive in SAE space
- Low Density filtering for SAE features
- Remove BOS features
- Top-k select features
- SAE-TS like effect approximator