2022 Dec
It turns out that an extremely simple method – training a single layer autoencoder to reconstruct neural activations with an L1 penalty on hidden activations – doesn’t just identify features that minimize the loss, but actually recovers the ground truth features that generated the data. However, at least using this method of sparse coding, it’s extremely costly to extract features from superposition (possibly more costly than training the models themselves)
- The L1 penalty coefficient needs to be just right
- We need more learned features than ground truth features
2023
2024
Anthropic
OpenAI
OpenAI K-sparse AutoEncoder to directly control sparsity and improving the reconstruction-sparsity frontier (tradeoff) with finding scaling laws.