SAE Unlearning

Creator
Creator
Seonglae Cho
Created
Created
2025 Jan 11 22:13
Editor
Edited
Edited
2025 Feb 27 14:52
SAE Unlearning Methods
 
 
 
 

Limitation

SAE for unlearning concepts were not really helpful
Interventions aimed at removing specific knowledge led to performance degradation in domains unrelated to biology, and the loss itself increased in texts like openwebtext. Compared to negative scaling, clamping had fewer side effects and was more effective.
 
 

Recommendations