FGAA

Feature Guided Activation Additions

A technique that creates interpretable and precise steering vectors through density filtering, BOS removal, top feature selection, and linear approximation optimization. It demonstrates superior behavioral control effects and output consistency across most tasks compared to

CAA, SAE decoder steering, and

SAE-TS.

Contrastive in SAE space

Low Density filtering for SAE features

Remove BOS features

Top-k select features

SAE-TS like effect approximater

Interpretable Steering of Large Language Models with Feature Guided...

Effective and reliable control over Large Language Model behavior is a significant challenge. While activation steering methods, which add steering vectors to a model's hidden states, are a...

https://openreview.net/forum?id=swRxS7s4rB

FGAA

Feature Guided Activation Additions

Recommendations