SAE
Decode with Sparse Representation
A weak dictionary learning algorithm called a sparse autoencoder generates learned features from a trained model that offer a more monosemantic unit of analysis than the model's neurons themselves.
The Sparse AutoEncoder uses L1 Loss to force most features to zero (sparsity) while maintaining only a few active ones.
Using SAE Structure to decompose neuron activations in transformer models
It applies L2 reconstruction loss + L1 penalty to the hidden activation layer. By focusing on the MLP sections of transformer models, it uses MLP layer activations as input and output for training. This approach emphasizes information the model deems important, providing interpretable insights. Some papers utilize L0 loss to simply raising sparsity without any side effect like JumpReLU SAE.
The sparse architectural approach (approach 1) was insufficient to prevent poly-semanticity, and that standard dictionary learning methods (approach 2) had significant issues with overfitting.
Activation vector → Dictionary vector → Reconstructed vector
2015 sparse autoencoder
Sparse Overcomplete Word Vector Representations
Current distributed representations of words show little resemblance to theories of lexical semantics. The former are dense and uninterpretable, the latter largely based on familiar, discrete...
https://arxiv.org/abs/1506.02004

2017 diverse use case
Alzheimer detection from MRI
www.ijml.org
https://www.ijml.org/vol7/612-IP018.pdf
Sparse Autoencoder for Unsupervised Nucleus Detection and...
Histopathology images are crucial to the study of complex diseases such as cancer. The histologic characteristics of nuclei play a key role in disease diagnosis, prognosis and analysis. In this...
https://arxiv.org/abs/1704.00406

EndNet: Sparse AutoEncoder Network for Endmember Extraction and...
Data acquired from multi-channel sensors is a highly valuable asset to interpret the environment for a variety of remote sensing applications. However, low spatial resolution is a critical...
https://arxiv.org/abs/1708.01894

Image Classification Based on Convolutional Denoising Sparse Autoencoder
Image classification aims to group images into corresponding semantic categories. Due to the difficulties of interclass similarity and intraclass variability, it is a challenging issue in computer vi...
https://onlinelibrary.wiley.com/doi/10.1155/2017/5218247

2018 interpretability
SPINE: SParse Interpretable Neural Embeddings
Prediction without justification has limited utility. Much of the success of neural models can be attributed to their ability to learn rich, dense and expressive representations. While these...
https://arxiv.org/abs/1711.08792

arxiv.org
https://arxiv.org/pdf/1809.08621
2019
www.sciencedirect.com
https://www.sciencedirect.com/science/article/abs/pii/S0925231219312275
2021
aclanthology.org
https://aclanthology.org/2021.blackboxnlp-1.12.pdf
2022 Applying to Transformer ModelResidual Stream not input embedding as beforeㅑㅊ
Taking features out of superposition with sparse autoencoders
[Interim research report] Taking features out of superposition with sparse autoencoders — LessWrong
We're thankful for helpful comments from Trenton Bricken, Eric Winsor, Noa Nabeshima, and Sid Black. …
https://www.lesswrong.com/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition
2023 interpretability work
Sparse Autoencoders Find Highly Interpretable Features in Language Models
One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts....
https://arxiv.org/abs/2309.08600

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer.Browse A/1 Features →Browse All Features →
https://transformer-circuits.pub/2023/monosemantic-features
[인공지능] 생성 모델 (Generative AI Model)이란? (1) : AutoEncoder, VAE, GAN
학습목표 1. 지도/비지도/준지도 학습의 특징을 구분할 수 있습니다.2. 생성 모델의 대표적인 두 모델(VAE 및 GAN)의 차이점을 알게 됩니다.3. GAN의 활용 범위에 대해 말할 수 있습니다. 1. 기계 학습(ML) 방법의 구분 2) Implicit density : 모델을 명확히 정의하는 대신 샘플링을 반복하여 특정 확률 분포에 수렴시킴 (Markov Chain) 생성 모델(Generative Model)의 분류 Training Data 의 분포에 시킨다는 특징이 공통점입니다.근사 (Approximate) 아래 Ian Goodfellow 선생님이 "Tutorial on Generative Adversarial Networks (2017)"에 첨부한 모식도를 보면 이해가 쉬울 것입니다.
https://newstellar.tistory.com/25

Seonglae Cho