SAE Clamping

Creator
Creator
Seonglae Cho
Created
Created
2025 Jan 11 22:13
Editor
Edited
Edited
2025 Feb 9 15:55
Refs
Refs
When specific SAE features are activated, the model's output is modified by either clamping to a fixed negative value or negative scaling of the activation value. Simply setting features to 0 (ablating) is ineffective, and adjustments to negative values were necessary for knowledge removal effects to emerge. Additionally, when testing with the same questions in different orders, the targeted knowledge was only effectively removed in some cases.
 
 
 
 
 

Refusal

Unlearning

Interventions aimed at removing specific knowledge led to performance degradation in domains unrelated to biology, and the loss itself increased in texts like openwebtext. Compared to negative scaling, clamping had fewer side effects and was more effective.
 
 
 

Recommendations