SAE Decoder Loss

SAE Feature Direction Loss

The

SAE Feature Absorption and co-occurrence problems cause the model to learn "broken latents". While tied SAEs have cleaner representations due to identical encoder and decoder weights, issues still arise when there are insufficient latents for concepts like parent-child relationships.

To mitigate this mixing phenomenon, an auxiliary loss function (squared cosine similarity between inputs and feature directions at low activation states) is introduced to encourage single peaks in activation strength.

Broken Latents: Studying SAEs and Feature Co-occurrence in Toy Models — LessWrong

Thanks to Jean Kaddour, Tomáš Dulka, and Joseph Bloom for providing feedback on earlier drafts of this post. …

https://www.lesswrong.com/posts/XHpta8X85TzugNNn2/broken-latents-studying-saes-and-feature-co-occurrence-in

Tanh loss

Achieves Pareto-optimality by "minimizing feature activations while maintaining low output error"

Circuit Tracing: Revealing Computational Graphs in Language Models

We describe an approach to tracing the “step-by-step” computation involved when a model responds to a single prompt.

https://transformer-circuits.pub/2025/attribution-graphs/methods.html

SAE Decoder Loss

SAE Feature Direction Loss

Tanh loss

Recommendations