SAE Unlearning Methods
Limitation
SAE for unlearning concepts were not really helpful
Interventions aimed at removing specific knowledge led to performance degradation in domains unrelated to biology, and the loss itself increased in texts like openwebtext. Compared to negative scaling, clamping had fewer side effects and was more effective.