Sparse Autoencoder

Creator

Creator

Seonglae Cho

Created

Created

2021 Nov 30 5:27

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Mar 28 12:0

Refs

Refs

Dictionary Learning

Superposition Hypothesis

Model Regularization

SAE

Decode with Sparse Representation

A weak dictionary learning algorithm called a sparse autoencoder generates learned features from a trained model that offer a more monosemantic unit of analysis than the model's neurons themselves.

The Sparse AutoEncoder uses

L1 Loss to force most features to zero (sparsity) while maintaining only a few active ones.

Using

Sparse Autoencoder to decompose neuron activations in transformer models

It applies L2 reconstruction loss + L1 penalty to the hidden activation layer. By focusing on the MLP sections of transformer models, it uses MLP layer activations as input and output for training. This approach emphasizes information the model deems important, providing interpretable insights. Some papers utilize L0 loss to simply raising sparsity without any side effect like

The sparse architectural approach (approach 1) was insufficient to prevent poly-semanticity, and that standard dictionary learning methods (approach 2) had significant issues with overfitting.

Activation vector → Dictionary vector → Reconstructed vector

Reconstruction

Loss

2015 sparse autoencoder

Sparse Overcomplete Word Vector Representations

Current distributed representations of words show little resemblance to theories of lexical semantics. The former are dense and uninterpretable, the latter largely based on familiar, discrete...

https://arxiv.org/abs/1506.02004

2022 Taking features out of superposition with sparse autoencoders

[Interim research report] Taking features out of superposition with sparse autoencoders — LessWrong

We're thankful for helpful comments from Trenton Bricken, Eric Winsor, Noa Nabeshima, and Sid Black. …

[Interim research report] Taking features out of superposition with sparse autoencoders — LessWrong

https://www.lesswrong.com/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition

[Interim research report] Taking features out of superposition with sparse autoencoders — LessWrong

2023 interpretability work

Sparse Autoencoders Find Highly Interpretable Features in Language Models

One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts....

https://arxiv.org/abs/2309.08600

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer.Browse A/1 Features →Browse All Features →

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

https://transformer-circuits.pub/2023/monosemantic-features

[인공지능] 생성 모델 (Generative AI Model)이란? (1) : AutoEncoder, VAE, GAN

학습목표 1. 지도/비지도/준지도 학습의 특징을 구분할 수 있습니다.2. 생성 모델의 대표적인 두 모델(VAE 및 GAN)의 차이점을 알게 됩니다.3. GAN의 활용 범위에 대해 말할 수 있습니다. 1. 기계 학습(ML) 방법의 구분 2) Implicit density : 모델을 명확히 정의하는 대신 샘플링을 반복하여 특정 확률 분포에 수렴시킴 (Markov Chain) 생성 모델(Generative Model)의 분류 Training Data 의 분포에 시킨다는 특징이 공통점입니다.근사 (Approximate) 아래 Ian Goodfellow 선생님이 "Tutorial on Generative Adversarial Networks (2017)"에 첨부한 모식도를 보면 이해가 쉬울 것입니다.

https://newstellar.tistory.com/25

[인공지능] 생성 모델 (Generative AI Model)이란? (1) : AutoEncoder, VAE, GAN

Backlinks

Neuron SAE EleutherAI Activation Engineering Sparse Autoencoder Neuron SAE

Recommendations

/////////