SAE

Linear Readout of features on superposition

Sparse Autoencoder is very similar in architecture to the MLP layers in language models, and so should be similarly powerful in its ability to recover features from superposition.

A neuron refers to an activation within a model, while a feature refers to an activation that has been separated by a sparse autoencoder

Reconstructed Transformer NLL: Anthropic would like the features we discover to explain almost all of the behavior of the underlying transformer. One way to measure this is to take a transformer, run the MLP activations through our autoencoder, replace the MLP activations with the autoencoder predictions, measure the loss on the training dataset, and calculate the difference in loss.

Ablation study

Anthropic performs feature ablations by running the model on an entire context up through the MLP layer, running the autoencoder to compute feature activations, subtracting the feature direction times its activation from the MLP activation on each token in the context (replacing with ) and then completing the forward pass. We record the resulting change in the predicted log-likelihood of each token in the context in the color of an underline of that token. Thus if a feature were active on token [B] in the sequence [A][B][C], and ablating that feature reduced the odds placed on the prediction of C, then there would be an orange background on [B] (the activation) and a blue underline on [C] (the ablation effect), indicating that ablating that feature increased the model’s loss on the prediction of [C] and hence that feature is responsible for improving the model’s ability to predict [C] in that context.

The additional loss incurred by replacing the MLP activations with the autoencoder's output is just 21% of the loss that would be incurred by zero ablating the MLP. This loss penalty can be reduced by using more features, or using a lower L1 coefficient.

One issue is that we don't believe our features are completely monosemantic (some polysemanticity may be hiding in low activations), nor are all of them necessarily cleanly interpretable.

Steering Vector Usages

For instance, a particular characteristic linked with the model unquestioningly concurring with the user was discovered by researchers. By deliberately enabling this feature, the model's reaction and behavior are entirely modified. This paves the way for the complete mapping of all LLM features and the potential to control them for increased safety, such as by prohibiting certain features and artificially triggering others.