Sparse Autoencoder

Creator
Creator
Seonglae Cho
Created
Created
2021 Nov 30 5:27
Editor
Edited
Edited
2025 Mar 28 12:0

SAE

Decode with Sparse Representation

A weak dictionary learning algorithm called a sparse autoencoder generates learned features from a trained model that offer a more monosemantic unit of analysis than the model's neurons themselves.
The Sparse AutoEncoder uses
L1 Loss
to force most features to zero (sparsity) while maintaining only a few active ones.
Using
Sparse Autoencoder
to decompose neuron activations in transformer models
It applies L2 reconstruction loss + L1 penalty to the hidden activation layer. By focusing on the MLP sections of transformer models, it uses MLP layer activations as input and output for training. This approach emphasizes information the model deems important, providing interpretable insights. Some papers utilize L0 loss to simply raising sparsity without any side effect like
JumpReLU SAE
.
The sparse architectural approach (approach 1) was insufficient to prevent poly-semanticity, and that standard dictionary learning methods (approach 2) had significant issues with overfitting.

Activation vector → Dictionary vector → Reconstructed vector

Reconstruction

Loss

2015 sparse autoencoder

2022 Taking features out of superposition with sparse autoencoders

2023 interpretability work

 
 

Recommendations