basic
[Interim research report] Taking features out of superposition with sparse autoencoders — LessWrong
We're thankful for helpful comments from Trenton Bricken, Eric Winsor, Noa Nabeshima, and Sid Black. …
https://www.lesswrong.com/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition
2023
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer.Browse A/1 Features →Browse All Features →
https://transformer-circuits.pub/2023/monosemantic-features#setup-autoencoder-motivation

Seonglae Cho