SAE Feature

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Jan 8 20:24
Editor
Edited
Edited
2025 Mar 27 22:31
Refs

SAE Latent Unit

SAE enables feature disentanglement from Superposition

Responds to both abstract discussions and concrete examples
There are features representing more abstract properties of the input, might there also be more abstract, higher-level actions which trigger behaviors over the span of multiple tokens?
SAE Feature Notion
 
 
SAE Feature Metrics
 
 
 
 
 
Insane visualization of feature activation distribution with conditional distribution and log-likelihood
https://transformer-circuits.pub/2023/monosemantic-feature
 
 
 
 
 

What is Feature: the simplest factorization of the activations (
SAE MDL
)

As investigation of superposition has progressed, it's become clear that we don't really know what a "feature" is, despite them being central to our research agenda. Given a given a sparse factorization of the activations , let’s start by defining “total information” by fitting a probability distribution to the entries of the matrices and computing its entropy. (Larger dictionaries tend to require more information to represent, but sparser codes require less information to represent, which may counterbalance.) It turns out that measuring the "Total Information” seems to be an effective tool for determining the "correct number of features”, at least for synthetic data. Anthropic observes that dictionary learning solutions "bounce" when the dictionary size matches the true number of factors. This is also where the best MMCS score. If such bounces could be found in real data, it would seem like significant evidence that there are "real features" to be found.
High Activation Density could mean either that sparsity was not properly learned, or that it is an important feature needed in various situations. In the Feature Browser, SAE features show higher feature interpretability when they have more high activation
Quantile
, which demonstrates a limitation where SAE features have low interpretability for low activations and exhibit certain skewness.
However, features with the highest
Activation Density
in the
Activation Distribution
are less interpretable, mainly because these features typically don't have high activation values in absolute terms (not quantile). A well-classified and highly interpretable SAE feature should not show density that simply decreases with activation value, but rather should show clustering at high activation levels after an initial decrease.
Automated Interpretability
Metrics proposal from
EleutherAI
 
 

Recommendations