SAE Structure

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2021 Nov 30 5:27
Editor
Edited
Edited
2026 Feb 19 0:48

SAE

Decode with Sparse Representation

A weak dictionary learning algorithm called a sparse autoencoder generates learned features from a trained model that offer a more monosemantic unit of analysis than the model's neurons themselves.
The Sparse AutoEncoder uses
L1 Loss
to force most features to zero (sparsity) while maintaining only a few active ones.
Using
SAE Structure
to decompose neuron activations in transformer models
It applies L2 reconstruction loss + L1 penalty to the hidden activation layer. By focusing on the MLP sections of transformer models, it uses MLP layer activations as input and output for training. This approach emphasizes information the model deems important, providing interpretable insights. Some papers utilize L0 loss to simply raising sparsity without any side effect like
JumpReLU SAE
.
The sparse architectural approach (approach 1) was insufficient to prevent poly-semanticity, and that standard dictionary learning methods (approach 2) had significant issues with overfitting.

Activation vector → Dictionary vector → Reconstructed vector

Reconstruction

Loss

2015 sparse autoencoder

Sparse Overcomplete Word Vector Representations
Current distributed representations of words show little resemblance to theories of lexical semantics. The former are dense and uninterpretable, the latter largely based on familiar, discrete...
Sparse Overcomplete Word Vector Representations
2017 diverse use case
Alzheimer detection from MRI
www.ijml.org
Sparse Autoencoder for Unsupervised Nucleus Detection and...
Histopathology images are crucial to the study of complex diseases such as cancer. The histologic characteristics of nuclei play a key role in disease diagnosis, prognosis and analysis. In this...
Sparse Autoencoder for Unsupervised Nucleus Detection and...
EndNet: Sparse AutoEncoder Network for Endmember Extraction and...
Data acquired from multi-channel sensors is a highly valuable asset to interpret the environment for a variety of remote sensing applications. However, low spatial resolution is a critical...
EndNet: Sparse AutoEncoder Network for Endmember Extraction and...
Image Classification Based on Convolutional Denoising Sparse Autoencoder
Image classification aims to group images into corresponding semantic categories. Due to the difficulties of interclass similarity and intraclass variability, it is a challenging issue in computer vi...
Image Classification Based on Convolutional Denoising Sparse Autoencoder
2018 interpretability
SPINE: SParse Interpretable Neural Embeddings
Prediction without justification has limited utility. Much of the success of neural models can be attributed to their ability to learn rich, dense and expressive representations. While these...
SPINE: SParse Interpretable Neural Embeddings
arxiv.org
2019
www.sciencedirect.com
2021
aclanthology.org
2022 Applying to
Transformer Model
Residual Stream
not input embedding as beforeㅑㅊ
Taking features out of superposition with sparse autoencoders
[Interim research report] Taking features out of superposition with sparse autoencoders — LessWrong
We're thankful for helpful comments from Trenton Bricken, Eric Winsor, Noa Nabeshima, and Sid Black.  …
[Interim research report] Taking features out of superposition with sparse autoencoders — LessWrong

2023 interpretability work

Sparse Autoencoders Find Highly Interpretable Features in Language Models
One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts....
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer.Browse A/1 Features →Browse All Features →
[인공지능] 생성 모델 (Generative AI Model)이란? (1) : AutoEncoder, VAE, GAN
학습목표 1. 지도/비지도/준지도 학습의 특징을 구분할 수 있습니다.2. 생성 모델의 대표적인 두 모델(VAE 및 GAN)의 차이점을 알게 됩니다.3. GAN의 활용 범위에 대해 말할 수 있습니다. 1. 기계 학습(ML) 방법의 구분 2) Implicit density : 모델을 명확히 정의하는 대신 샘플링을 반복하여 특정 확률 분포에 수렴시킴 (Markov Chain) 생성 모델(Generative Model)의 분류 Training Data 의 분포에 시킨다는 특징이 공통점입니다.근사 (Approximate) 아래 Ian Goodfellow 선생님이 "Tutorial on Generative Adversarial Networks (2017)"에 첨부한 모식도를 보면 이해가 쉬울 것입니다.
[인공지능] 생성 모델 (Generative AI Model)이란? (1) : AutoEncoder, VAE, GAN
 
 

Recommendations