2022 Dec
It turns out that an extremely simple method – training a single layer autoencoder to reconstruct neural activations with an L1 penalty on hidden activations – doesn’t just identify features that minimize the loss, but actually recovers the ground truth features that generated the data. However, at least using this method of sparse coding, it’s extremely costly to extract features from superposition (possibly more costly than training the models themselves)
- The L1 penalty coefficient needs to be just right
- We need more learned features than ground truth features
[Interim research report] Taking features out of superposition with sparse autoencoders — LessWrong
We're thankful for helpful comments from Trenton Bricken, Eric Winsor, Noa Nabeshima, and Sid Black. …
https://www.lesswrong.com/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition
2023
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer.Browse A/1 Features →Browse All Features →
https://transformer-circuits.pub/2023/monosemantic-features#setup-autoencoder-motivation
2024
Anthropic
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Eight months ago, we demonstrated that sparse autoencoders could recover monosemantic features from a small one-layer transformer. At the time, a major concern was that this method might not scale feasibly to state-of-the-art transformers and, as a result, be unable to practically contribute to AI safety. Since then, scaling sparse autoencoders has been a major priority of the Anthropic interpretability team, and we're pleased to report extracting high-quality features from Claude 3 Sonnet,For clarity, this is the 3.0 version of Claude 3 Sonnet, released March 4, 2024. It is the exact model in production as of the writing of this paper. It is the finetuned model, not the base pretrained model (although our method also works on the base model). Anthropic's medium-sized production model.
https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#assessing-tour-influence/
OpenAI
OpenAI K-sparse AutoEncoder to directly control sparsity and improving the reconstruction-sparsity frontier (tradeoff) with finding scaling laws.
SAE viewer
Web site created using create-react-app
https://openaipublic.blob.core.windows.net/sparse-autoencoder/sae-viewer/index.html

Seonglae Cho