SAE Clamping

When specific SAE features are activated, the model's output is modified by either clamping to a fixed negative value or negative scaling of the activation value. Simply setting features to 0 (ablating) is ineffective, and adjustments to negative values were necessary for knowledge removal effects to emerge. Additionally, when testing with the same questions in different orders, the targeted knowledge was only effectively removed in some cases.

Refusal

Steering Language Model Refusal with Sparse Autoencoders

Responsible practices for deploying language models include guiding models to recognize and refuse answering prompts that are considered unsafe, while complying with safe prompts. Achieving such behavior typically requires updating model weights, which is costly and inflexible. We explore opportunities to steering model activations at inference time, which does not require updating weights. Using sparse autoencoders, we identify and steer features in Phi-3 Mini that mediate refusal behavior. We find that feature steering can improve Phi-3 Mini’s robustness to jailbreak attempts across various harms, including challenging multi-turn attacks. However, we discover that feature steering can adversely affect overall performance on benchmarks. These results suggest that identifying steerable mechanisms for refusal via sparse autoencoders is a promising approach for enhancing language model safety, but that more research is needed to mitigate feature steering’s adverse effects on performance.

https://arxiv.org/html/2411.11296v1

Unlearning

Interventions aimed at removing specific knowledge led to performance degradation in domains unrelated to biology, and the loss itself increased in texts like openwebtext. Compared to negative scaling, clamping had fewer side effects and was more effective.

arxiv.org

https://arxiv.org/pdf/2410.19278

SAE Clamping

Refusal

Unlearning

Recommendations