Monosemanticity

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Nov 29 11:59
Editor
Edited
Edited
2024 Oct 16 15:24

Mono-semanticity

Concretely, Anthropic AI takes a one-layer transformer with a 512-neuron MLP layer, and decompose the MLP activations into relatively interpretable features by training sparse autoencoders on MLP activations from 8 billion data points, with expansion factors ranging from 1× (512 features) to 256× (131,072 features). We focus our detailed interpretability analyses on the 4,096 features learned in one run we call A/1.
notion image
Among them, in the successful A/1, 3,928 out of 4,096 vectors were successfully classified, and 168 were determined to be dead neurons because we should expect the feature directions to form an overcomplete basis.
That is, our decomposition should have more directions than neurons. Moreover, the feature activations should be sparse, because sparsity is what enables this kind of noisy simulation. This is mathematically identical to the classic problem of dictionary learning.
Specific configurations for Transformer and Sparse Autoencoder used for experiment
Specific configurations for Transformer and Sparse Autoencoder used for experiment
Sparse AutoEncoder
를 사용하여 트랜스포머 모델의 뉴런 활성화를 분해함
L2 reconstruction loss + L1 penalty to hidden activation layer. 트랜스포머 모델 내의 MLP 부분에 초점을 맞추어, MLP 레이어들의 활성화를 Sparse AutoEncoder의 입력과 출력으로 사용하여 훈련. 이런 방식으로, 모델이 중요하다고 판단하는 정보만을 강조하여 해석 가능한 정보를 제공한다.
The sparse architectural approach (approach 1) was insufficient to prevent poly-semanticity, and that standard dictionary learning methods (approach 2) had significant issues with overfitting.

The Pile

뉴런을 Sparse AutoEncoder를 활용해 각각의 역할을 더 강조시킨 뒤 각각의 뉴런의 activation이 가장 많이 될때를 사람이 분석하여 무슨 역할인지 labeling. monosemantic을 향해 가려는 게 아니라 학습시키고 보니 뉴런들이 monosemantic였다. Sparse autoencoders produce interpretable features that are effectively invisible in the neuron basis.
Neuron Activation for each topic
Neuron Activation for each topic
notion image
Sparse autoencoder features can be used to intervene on and steer transformer generation. (
Prompt Guardrail
에 새로운 방식,
AI Alignment
) 기존 decoding strategy와 prompt engineering 혹은 RL training에 집중한 방식은 결국 결함이 있을수밖에 없기 때문에, 미래 서비스는 이런 방식으로 뉴런단위 feature를 컨트롤되어 안정된 서비스를 이용하지 않을까 싶다. 이런 위험성은 큰 기업일수록 중요하게 다가오기 때문에 top-2 회사인 OpenAI와 Anthropic이 집중하고 있는 분야일듯 하다.
 
 
 
 

Feature splitting

  • For example, one base64 feature in a small dictionary splits into three roles, with more subtle and yet still interpretable roles, in a larger dictionary. Despite the MLP layer being very small, we continue to find new features as we scale the sparse autoencoder.

Universality

  • Sparse autoencoders produce relatively universal features. Sparse autoencoders applied to different transformer language models produce mostly similar features, more similar to one another than they are to their own model's neurons. 

FSM

SUPERPOSITION HYPOTHESIS
가장 흥미로운 부분으로 For example, we find features that work together to generate valid HTML.
decomposition
 also takes similar strategy
Positional Embedding
also takes similar strategy
 
Naively, simulating a layer with 100k features would be 100,000 times more expensive than sampling a large language model such as Claude 2 or GPT-4 (we'd need to sample the explaining large model once for every feature at every token). At current prices, this would suggest simulating 100k features on a single 4096 token context would cost $12,500–$25,000, and one would presumably need to evaluate over many contexts.
 
 
 

Neurons

notion image
notion image
notion image
 
 
The distance between similar features reflected their similarity. In other words, the model has a mapping of concepts that are as close to each other as their associated features.
 
 

Short

Long

Dictionary

2024
Claude 3 Sonnet

 
 

Recommendations