Efficient Dictionary Learning with Switch Sparse Autoencoders
Sparse autoencoders (SAEs) are a recent technique for decomposing neural network activations into human-interpretable features. However, in order for SAEs to identify all features represented in...
https://arxiv.org/abs/2410.08201