Monosemanticity

Creator
Creator
Seonglae Cho
Created
Created
2023 Nov 29 11:59
Editor
Edited
Edited
2025 Jun 21 20:34

Mono-semanticity

Concretely, Anthropic AI takes a one-layer transformer with a 512-neuron MLP layer, and decompose the MLP activations into relatively interpretable features by training sparse autoencoders on MLP activations from 8 billion data points, with expansion factors ranging from 1× (512 features) to 256× (131,072 features). We focus our detailed interpretability analyses on the 4,096 features learned in one run we call A/1.
notion image
Among them, in the successful A/1, 3,928 out of 4,096 vectors were successfully classified, and 168 were determined to be dead neurons because we should expect the feature directions to form an overcomplete basis.
That is, our decomposition should have more directions did_i than neurons. Moreover, the feature activations should be sparse, because sparsity is what enables this kind of noisy simulation. This is mathematically identical to the classic problem of dictionary learning.
Specific configurations for Transformer and Sparse Autoencoder used for experiment
Specific configurations for Transformer and Sparse Autoencoder used for experiment
Using
Sparse Autoencoder
to decompose neuron activations in transformer models

The Pile

Using Sparse AutoEncoders, researchers analyzed each neuron's role by examining when their activations peaked highest and labeling their functions. Rather than intentionally pursuing monosemanticity, they discovered the neurons were naturally monosemantic after training. Sparse autoencoders produce interpretable features that are effectively invisible in the neuron basis.
Neuron Activation for each topic
Neuron Activation for each topic
notion image
Sparse autoencoder features can be used to intervene on and steer transformer generation. This represents a new approach to
AI Guardrail
and
AI Alignment
. Since existing approaches focused on decoding strategies, prompt engineering, or RL training inevitably have flaws, future services might use this neuron-level feature control method for more stable operation. This risk management aspect is particularly crucial for large companies, which is why top-2 companies like OpenAI and Anthropic are focusing on this area.
The distance between similar features reflected their similarity. In other words, the model has a mapping of concepts that are as close to each other as their associated features.

SAE Feature Splitting

  • For example, one base64 feature in a small dictionary splits into three roles, with more subtle and yet still interpretable roles, in a larger dictionary. Despite the MLP layer being very small, we continue to find new features as we scale the sparse autoencoder.

SAE Feature Universality

  • Sparse autoencoders produce relatively universal features. Sparse autoencoders applied to different transformer language models produce mostly similar features, more similar to one another than they are to their own model's neurons. 

Finite State Automata
(
AI Circuit
)

One of the most interesting parts is that we find features that work together to generate valid HTML.
 also takes similar strategy
Positional Embedding
also takes similar strategy
Naively, simulating a layer with 100k features would be 100,000 times more expensive than sampling a large language model such as Claude 2 or GPT-4 (we'd need to sample the explaining large model once for every feature at every token). At current prices, this would suggest simulating 100k features on a single 4096 token context would cost $12,500–$25,000, and one would presumably need to evaluate over many contexts.
 
 
 

SAE Latent Neuron Examples

notion image
notion image
notion image
 
 

Trenton Bricken

Short version

Dictionary

2024
Claude 3 Sonnet

Including decoder weight into the Loss
Korean
alignment (emnlp main 2024)
 
 

Recommendations