Gradient SAE

g-SAEs (Gradient-aware Sparse Autoencoders)

Existing SAEs are likely to overlook latent variable elements that have a significant impact on output since they learn only based on input activation values. In other words, they aim to better reflect the dual role of latents in both the model's representation and behavioral aspects.

By selecting latent variables that are sensitive to changes in output loss, it has fewer activations (

SAE Dead Feature) than conventional SAEs and uses the latent space more efficiently.

arxiv.org

https://arxiv.org/pdf/2411.10397

Gradient SAE

g-SAEs (Gradient-aware Sparse Autoencoders)

Backlinks

Recommendations