SAE Feature Direction Loss

Creator
Creator
Seonglae Cho
Created
Created
2025 Mar 13 17:23
Editor
Edited
Edited
2025 Mar 13 17:23
Refs
Refs
 
 
 
 
 
The
SAE Feature Absorption
and co-occurrence problems cause the model to learn "broken latents". While tied SAEs have cleaner representations due to identical encoder and decoder weights, issues still arise when there are insufficient latents for concepts like parent-child relationships.
To mitigate this mixing phenomenon, an auxiliary loss function (squared cosine similarity between inputs and feature directions at low activation states) is introduced to encourage single peaks in activation strength.
 
 
 

Recommendations