SAE
Decode with Sparse Representation
A weak dictionary learning algorithm called a sparse autoencoder generates learned features from a trained model that offer a more monosemantic unit of analysis than the model's neurons themselves.
The Sparse AutoEncoder uses L1 Loss to force most features to zero (sparsity) while maintaining only a few active ones.
Using Sparse Autoencoder to decompose neuron activations in transformer models
It applies L2 reconstruction loss + L1 penalty to the hidden activation layer. By focusing on the MLP sections of transformer models, it uses MLP layer activations as input and output for training. This approach emphasizes information the model deems important, providing interpretable insights. Some papers utilize L0 loss to simply raising sparsity without any side effect like JumpReLU SAE.
The sparse architectural approach (approach 1) was insufficient to prevent poly-semanticity, and that standard dictionary learning methods (approach 2) had significant issues with overfitting.
Activation vector → Dictionary vector → Reconstructed vector
2015 sparse autoencoder
2022 Taking features out of superposition with sparse autoencoders