Sparse Autoencoder

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Apr 7 15:21
Editor
Edited
Edited
2026 Feb 19 0:48
Refs
Refs

Linear Readout of features on superposition

A
SAE Structure
is very similar in architecture to the MLP layers in language models, and so should be similarly powerful in its ability to recover features from superposition.
A neuron refers to an activation within a model, while a feature refers to an activation that has been separated by a sparse autoencoder
Reconstructed Transformer NLL: Anthropic would like the features we discover to explain almost all of the behavior of the underlying transformer. One way to measure this is to take a transformer, run the MLP activations through our autoencoder, replace the MLP activations with the autoencoder predictions, measure the loss on the training dataset, and calculate the difference in loss.

Ablation study

Anthropic performs feature ablations by running the model on an entire context up through the MLP layer, running the autoencoder to compute feature activations, subtracting the feature direction times its activation from the MLP activation on each token in the context (replacing with ) and then completing the forward pass. We record the resulting change in the predicted log-likelihood of each token in the context in the color of an underline of that token. Thus if a feature were active on token [B] in the sequence [A][B][C], and ablating that feature reduced the odds placed on the prediction of C, then there would be an orange background on [B] (the activation) and a blue underline on [C] (the ablation effect), indicating that ablating that feature increased the model’s loss on the prediction of [C] and hence that feature is responsible for improving the model’s ability to predict [C] in that context.
The additional loss incurred by replacing the MLP activations with the autoencoder's output is just 21% of the loss that would be incurred by zero ablating the MLP. This loss penalty can be reduced by using more features, or using a lower L1 coefficient.
One issue is that we don't believe our features are completely monosemantic (some polysemanticity may be hiding in low activations), nor are all of them necessarily cleanly interpretable.

Steering Vector
Usages

For instance, a particular characteristic linked with the model unquestioningly concurring with the user was discovered by researchers. By deliberately enabling this feature, the model's reaction and behavior are entirely modified. This paves the way for the complete mapping of all LLM features and the potential to control them for increased safety, such as by prohibiting certain features and artificially triggering others.
  • Show associated traits with every response?
  • If a not needed or undesired 'feature' is triggered by our cue, we can adjust the cue to deliberately avoid it.
Sparse AutoEncoders
 
 
Neuron SAE Notion
 
 
 
 
https://openai.com/index/extracting-concepts-from-gpt-4/
https://www.lesswrong.com/posts/ATsvzF77ZsfWzyTak/dataset-sensitivity-in-feature-matching-and-a-hypothesis-on-1#4_1_2__Feature_Activation_with_Quantiles
https://www.tilderesearch.com/blog/rate-distortion-saes
https://www.lesswrong.com/posts/feknAa3hQgLG2ZAna/cross-layer-feature-alignment-and-steering-in-large-language-2
  • MLP SAE
  • Attention SAE
 
 

Short explanation

An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability
Sparse Autoencoders (SAEs) have recently become popular for interpretability of machine learning models (although sparse dictionary learning has been around since 1997). Machine learning models and LLMs are becoming more powerful and useful, but they are still black boxes, and we don’t understand how they do the things that they are capable of. It seems like it would be useful if we could understand how they work.

Engineering challenges

The engineering challenges of scaling interpretability
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
The engineering challenges of scaling interpretability
Open problems for SAE
Sparsify: A mechanistic interpretability research agenda — AI Alignment Forum
Over the last couple of years, mechanistic interpretability has seen substantial progress. Part of this progress has been enabled by the identificati…
Sparsify: A mechanistic interpretability research agenda — AI Alignment Forum
 
 

Recommendations