Monosemanticity

Mono-semanticity

Concretely, Anthropic AI takes a one-layer transformer with a 512-neuron MLP layer, and decompose the MLP activations into relatively interpretable features by training sparse autoencoders on MLP activations from 8 billion data points, with expansion factors ranging from 1× (512 features) to 256× (131,072 features). We focus our detailed interpretability analyses on the 4,096 features learned in one run we call A/1.

Among them, in the successful A/1, 3,928 out of 4,096 vectors were successfully classified, and 168 were determined to be dead neurons because we should expect the feature directions to form an overcomplete basis.

That is, our decomposition should have more directions

d_i

than neurons. Moreover, the feature activations should be sparse, because sparsity is what enables this kind of noisy simulation. This is mathematically identical to the classic problem of dictionary learning.

Specific configurations for Transformer and Sparse Autoencoder used for experiment

Using

Sparse Autoencoder to decompose neuron activations in transformer models

The Pile

Using Sparse AutoEncoders, researchers analyzed each neuron's role by examining when their activations peaked highest and labeling their functions. Rather than intentionally pursuing monosemanticity, they discovered the neurons were naturally monosemantic after training. Sparse autoencoders produce interpretable features that are effectively invisible in the neuron basis.

Sparse autoencoder features can be used to intervene on and steer transformer generation. This represents a new approach to

AI Guardrail and

AI Alignment. Since existing approaches focused on decoding strategies, prompt engineering, or RL training inevitably have flaws, future services might use this neuron-level feature control method for more stable operation. This risk management aspect is particularly crucial for large companies, which is why top-2 companies like OpenAI and Anthropic are focusing on this area.

The distance between similar features reflected their similarity. In other words, the model has a mapping of concepts that are as close to each other as their associated features.

SAE Feature Splitting

For example, one base64 feature in a small dictionary splits into three roles, with more subtle and yet still interpretable roles, in a larger dictionary. Despite the MLP layer being very small, we continue to find new features as we scale the sparse autoencoder.

SAE Feature Universality

Sparse autoencoders produce relatively universal features. Sparse autoencoders applied to different transformer language models produce mostly similar features, more similar to one another than they are to their own model's neurons.

Finite State Automata (
AI Circuit)

One of the most interesting parts is that we find features that work together to generate valid HTML.

**Positional Embedding** also takes similar strategy

Naively, simulating a layer with 100k features would be 100,000 times more expensive than sampling a large language model such as Claude 2 or GPT-4 (we'd need to sample the explaining large model once for every feature at every token). At current prices, this would suggest simulating 100k features on a single 4096 token context would cost $12,500–$25,000, and one would presumably need to evaluate over many contexts.