SAE
Decode with Sparse Representation
A weak dictionary learning algorithm called a sparse autoencoder generates learned features from a trained model that offer a more monosemantic unit of analysis than the model's neurons themselves.
The Sparse AutoEncoder uses L1 Loss to force most features to zero (sparsity) while maintaining only a few active ones.
Using Sparse Autoencoder to decompose neuron activations in transformer models
It applies L2 reconstruction loss + L1 penalty to the hidden activation layer. By focusing on the MLP sections of transformer models, it uses MLP layer activations as input and output for training. This approach emphasizes information the model deems important, providing interpretable insights. Some papers utilize L0 loss to simply raising sparsity without any side effect like JumpReLU SAE.
The sparse architectural approach (approach 1) was insufficient to prevent poly-semanticity, and that standard dictionary learning methods (approach 2) had significant issues with overfitting.
Activation vector → Dictionary vector → Reconstructed vector
2015 sparse autoencoder
Sparse Overcomplete Word Vector Representations
Current distributed representations of words show little resemblance to theories of lexical semantics. The former are dense and uninterpretable, the latter largely based on familiar, discrete...
https://arxiv.org/abs/1506.02004

2022 Taking features out of superposition with sparse autoencoders
[Interim research report] Taking features out of superposition with sparse autoencoders — LessWrong
We're thankful for helpful comments from Trenton Bricken, Eric Winsor, Noa Nabeshima, and Sid Black. …
https://www.lesswrong.com/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition
2023 interpretability work
Sparse Autoencoders Find Highly Interpretable Features in Language Models
One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts....
https://arxiv.org/abs/2309.08600

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer.Browse A/1 Features →Browse All Features →
https://transformer-circuits.pub/2023/monosemantic-features
[인공지능] 생성 모델 (Generative AI Model)이란? (1) : AutoEncoder, VAE, GAN
학습목표 1. 지도/비지도/준지도 학습의 특징을 구분할 수 있습니다.2. 생성 모델의 대표적인 두 모델(VAE 및 GAN)의 차이점을 알게 됩니다.3. GAN의 활용 범위에 대해 말할 수 있습니다. 1. 기계 학습(ML) 방법의 구분 2) Implicit density : 모델을 명확히 정의하는 대신 샘플링을 반복하여 특정 확률 분포에 수렴시킴 (Markov Chain) 생성 모델(Generative Model)의 분류 Training Data 의 분포에 시킨다는 특징이 공통점입니다.근사 (Approximate) 아래 Ian Goodfellow 선생님이 "Tutorial on Generative Adversarial Networks (2017)"에 첨부한 모식도를 보면 이해가 쉬울 것입니다.
https://newstellar.tistory.com/25

Seonglae Cho