SAE Feature Universality

Creator
Creator
Seonglae Cho
Created
Created
2025 Jan 30 1:48
Editor
Edited
Edited
2025 Feb 21 23:46

Activation similarity

One natural approach is to think of a feature as a function assigning values to datapoints; two features would be similar in this sense if they take similar values over a diverse set of data.  In practice, this can be approximated by representing the feature as a vector, with indices corresponding to a fixed set of data points. We call the correlations between these vectors the activation similarity between features.

Logit weight similarity

A second natural approach is to think of a feature in terms of its downstream effects; two features would be similar in this sense if their activation changes their models' predictions in similar ways. In our one-layer model, a simple approximation to this is the logit weights. This approximation represents each feature as a vector with indices corresponding to vocabulary tokens. We call the correlations between these vectors the logit weight similarity between features.
https://transformer-circuits.pub/2023/monosemantic-features#phenomenology-universality
notion image
Since Logit Weigth correlation represents prediction badly

Attribution similarity

We want to measure something more like "the actual effect a feature has on token probabilities." One way to get at this would be to compute a vector of ablation effects for every feature on every data point; Unfortunately, this would be rather expensive computationally. Instead, we scale the activation vector of a feature by the logit weights of the tokens that empirically come next in the dataset to produce an attribution vector
 
 
One of those approaches was to engineer models to simply not have superposition in the first place. Unfortunately, having spent a significant amount of time investigating this approach, we have ultimately concluded that it is more fundamentally non-viable.
Do we only observe the same feature in divergent models? Anthropic observes substantial universality and at a high-level, this makes sense: if a feature is useful to one model in representing the dataset, it's likely useful to others, and if two models represent the same feature then a good dictionary learning algorithm should find it.
After matching feature neurons across models via activation correlation, they apply representational space similarity metrics like Singular Value Canonical Correlation. Their experiments reveal similarities in SAE feature spaces across various LLMs, providing evidence for feature universality.

Depends on seed and dataset (
SAE Training
)

Weight Cosine Similarity
Orphan features still shows high interpretability which indicates the different seed may have found the subset of the “idealized dictionary size”.
  • dataset - matters more than seed
 
 
 

Recommendations