SAE Structure

Creator

Creator

Seonglae Cho

Created

Created

2021 Nov 30 5:27

Editor

Editor

Seonglae Cho

Edited

Edited

2026 Feb 19 0:48

Refs

Refs

Dictionary Learning

Superposition Hypothesis

Model Regularization

Sparse Autoencoder

SAE

Decode with Sparse Representation

A weak dictionary learning algorithm called a sparse autoencoder generates learned features from a trained model that offer a more monosemantic unit of analysis than the model's neurons themselves.

The Sparse AutoEncoder uses

L1 Loss to force most features to zero (sparsity) while maintaining only a few active ones.

Using

SAE Structure to decompose neuron activations in transformer models

It applies L2 reconstruction loss + L1 penalty to the hidden activation layer. By focusing on the MLP sections of transformer models, it uses MLP layer activations as input and output for training. This approach emphasizes information the model deems important, providing interpretable insights. Some papers utilize L0 loss to simply raising sparsity without any side effect like

The sparse architectural approach (approach 1) was insufficient to prevent poly-semanticity, and that standard dictionary learning methods (approach 2) had significant issues with overfitting.

Activation vector → Dictionary vector → Reconstructed vector

Reconstruction

Loss

2015 sparse autoencoder

Sparse Overcomplete Word Vector Representations

Current distributed representations of words show little resemblance to theories of lexical semantics. The former are dense and uninterpretable, the latter largely based on familiar, discrete...

https://arxiv.org/abs/1506.02004

2017 diverse use case

Alzheimer detection from MRI

https://www.ijml.org/vol7/612-IP018.pdf

Sparse Autoencoder for Unsupervised Nucleus Detection and...

Histopathology images are crucial to the study of complex diseases such as cancer. The histologic characteristics of nuclei play a key role in disease diagnosis, prognosis and analysis. In this...

https://arxiv.org/abs/1704.00406

EndNet: Sparse AutoEncoder Network for Endmember Extraction and...

Data acquired from multi-channel sensors is a highly valuable asset to interpret the environment for a variety of remote sensing applications. However, low spatial resolution is a critical...

https://arxiv.org/abs/1708.01894

Image Classification Based on Convolutional Denoising Sparse Autoencoder

Image classification aims to group images into corresponding semantic categories. Due to the difficulties of interclass similarity and intraclass variability, it is a challenging issue in computer vi...

https://onlinelibrary.wiley.com/doi/10.1155/2017/5218247

Image Classification Based on Convolutional Denoising Sparse Autoencoder

2018 interpretability

SPINE: SParse Interpretable Neural Embeddings

Prediction without justification has limited utility. Much of the success of neural models can be attributed to their ability to learn rich, dense and expressive representations. While these...

https://arxiv.org/abs/1711.08792

https://arxiv.org/pdf/1809.08621

2019

www.sciencedirect.com

https://www.sciencedirect.com/science/article/abs/pii/S0925231219312275

2021

aclanthology.org

https://aclanthology.org/2021.blackboxnlp-1.12.pdf

2022 Applying to

Transformer Model

Residual Stream not input embedding as beforeㅑㅊ

Taking features out of superposition with sparse autoencoders

[Interim research report] Taking features out of superposition with sparse autoencoders — LessWrong

We're thankful for helpful comments from Trenton Bricken, Eric Winsor, Noa Nabeshima, and Sid Black. …

[Interim research report] Taking features out of superposition with sparse autoencoders — LessWrong

https://www.lesswrong.com/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition

[Interim research report] Taking features out of superposition with sparse autoencoders — LessWrong

2023 interpretability work

Sparse Autoencoders Find Highly Interpretable Features in Language Models

One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts....

https://arxiv.org/abs/2309.08600

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer.Browse A/1 Features →Browse All Features →

https://transformer-circuits.pub/2023/monosemantic-features

[인공지능] 생성 모델 (Generative AI Model)이란? (1) : AutoEncoder, VAE, GAN

학습목표 1. 지도/비지도/준지도 학습의 특징을 구분할 수 있습니다.2. 생성 모델의 대표적인 두 모델(VAE 및 GAN)의 차이점을 알게 됩니다.3. GAN의 활용 범위에 대해 말할 수 있습니다. 1. 기계 학습(ML) 방법의 구분 2) Implicit density : 모델을 명확히 정의하는 대신 샘플링을 반복하여 특정 확률 분포에 수렴시킴 (Markov Chain) 생성 모델(Generative Model)의 분류 Training Data 의 분포에 시킨다는 특징이 공통점입니다.근사 (Approximate) 아래 Ian Goodfellow 선생님이 "Tutorial on Generative Adversarial Networks (2017)"에 첨부한 모식도를 보면 이해가 쉬울 것입니다.

[인공지능] 생성 모델 (Generative AI Model)이란? (1) : AutoEncoder, VAE, GAN

https://newstellar.tistory.com/25

[인공지능] 생성 모델 (Generative AI Model)이란? (1) : AutoEncoder, VAE, GAN

Backlinks

Sparse Autoencoder EleutherAI ThoughtComm Sparse Transformer SAE Structure Sparse Autoencoder

Recommendations

////////////