Neuron SAE history

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Oct 24 9:44
Editor
Edited
Edited
2025 Jan 26 19:40
Refs
Refs

2022 Dec

It turns out that an extremely simple method – training a single layer autoencoder to reconstruct neural activations with an L1 penalty on hidden activations – doesn’t just identify features that minimize the loss, but actually recovers the ground truth features that generated the data. However, at least using this method of sparse coding, it’s extremely costly to extract features from superposition (possibly more costly than training the models themselves)
  • The L1 penalty coefficient needs to be just right
  • We need more learned features than ground truth features
[Interim research report] Taking features out of superposition with sparse autoencoders — LessWrong
We're thankful for helpful comments from Trenton Bricken, Eric Winsor, Noa Nabeshima, and Sid Black.  …
[Interim research report] Taking features out of superposition with sparse autoencoders — LessWrong
2023
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer.Browse A/1 Features →Browse All Features →

2024

Anthropic
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Eight months ago, we demonstrated that sparse autoencoders could recover monosemantic features from a small one-layer transformer. At the time, a major concern was that this method might not scale feasibly to state-of-the-art transformers and, as a result, be unable to practically contribute to AI safety. Since then, scaling sparse autoencoders has been a major priority of the Anthropic interpretability team, and we're pleased to report extracting high-quality features from Claude 3 Sonnet,For clarity, this is the 3.0 version of Claude 3 Sonnet, released March 4, 2024. It is the exact model in production as of the writing of this paper. It is the finetuned model, not the base pretrained model (although our method also works on the base model). Anthropic's medium-sized production model.
OpenAI
OpenAI K-sparse AutoEncoder to directly control sparsity and improving the reconstruction-sparsity frontier (tradeoff) with finding scaling laws.
cdn.openai.com
SAE viewer
Web site created using create-react-app
 
 
 
 
 
 
 
 

Recommendations