Mono-semanticity
Concretely, Anthropic AI takes a one-layer transformer with a 512-neuron MLP layer, and decompose the MLP activations into relatively interpretable features by training sparse autoencoders on MLP activations from 8 billion data points, with expansion factors ranging from 1× (512 features) to 256× (131,072 features). We focus our detailed interpretability analyses on the 4,096 features learned in one run we call A/1.

Among them, in the successful A/1, 3,928 out of 4,096 vectors were successfully classified, and 168 were determined to be dead neurons because we should expect the feature directions to form an overcomplete basis.
That is, our decomposition should have more directions than neurons. Moreover, the feature activations should be sparse, because sparsity is what enables this kind of noisy simulation. This is mathematically identical to the classic problem of dictionary learning.

Using Sparse Autoencoder to decompose neuron activations in transformer models
The Pile
Using Sparse AutoEncoders, researchers analyzed each neuron's role by examining when their activations peaked highest and labeling their functions. Rather than intentionally pursuing monosemanticity, they discovered the neurons were naturally monosemantic after training. Sparse autoencoders produce interpretable features that are effectively invisible in the neuron basis.


Sparse autoencoder features can be used to intervene on and steer transformer generation. This represents a new approach to AI Guardrail and AI Alignment. Since existing approaches focused on decoding strategies, prompt engineering, or RL training inevitably have flaws, future services might use this neuron-level feature control method for more stable operation. This risk management aspect is particularly crucial for large companies, which is why top-2 companies like OpenAI and Anthropic are focusing on this area.
The distance between similar features reflected their similarity. In other words, the model has a mapping of concepts that are as close to each other as their associated features.
SAE Feature Splitting
- For example, one base64 feature in a small dictionary splits into three roles, with more subtle and yet still interpretable roles, in a larger dictionary. Despite the MLP layer being very small, we continue to find new features as we scale the sparse autoencoder.
SAE Feature Universality
- Sparse autoencoders produce relatively universal features. Sparse autoencoders applied to different transformer language models produce mostly similar features, more similar to one another than they are to their own model's neurons.
Finite State Automata (AI Circuit)
One of the most interesting parts is that we find features that work together to generate valid HTML.

Naively, simulating a layer with 100k features would be 100,000 times more expensive than sampling a large language model such as Claude 2 or GPT-4 (we'd need to sample the explaining large model once for every feature at every token). At current prices, this would suggest simulating 100k features on a single 4096 token context would cost $12,500–$25,000, and one would presumably need to evaluate over many contexts.
SAE Latent Neuron Examples



Trenton Bricken
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer.Browse A/1 Features →Browse All Features →
https://transformer-circuits.pub/2023/monosemantic-features/index.html
Short version
Decomposing Language Models Into Understandable Components
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
https://www.anthropic.com/index/decomposing-language-models-into-understandable-components

Dictionary
God Help Us, Let's Try To Understand The Paper On AI Monosemanticity
Inside every AI is a bigger AI, trying to get out
https://www.astralcodexten.com/p/god-help-us-lets-try-to-understand

2024 Claude 3 Sonnet
Including decoder weight into the Loss
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Eight months ago, we demonstrated that sparse autoencoders could recover monosemantic features from a small one-layer transformer. At the time, a major concern was that this method might not scale feasibly to state-of-the-art transformers and, as a result, be unable to practically contribute to AI safety. Since then, scaling sparse autoencoders has been a major priority of the Anthropic interpretability team, and we're pleased to report extracting high-quality features from Claude 3 Sonnet,For clarity, this is the 3.0 version of Claude 3 Sonnet, released March 4, 2024. It is the exact model in production as of the writing of this paper. It is the finetuned model, not the base pretrained model (although our method also works on the base model). Anthropic's medium-sized production model.
https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#assessing-tour-influence
Korean
alignment (emnlp main 2024)

Seonglae Cho