Activation similarity
One natural approach is to think of a feature as a function assigning values to datapoints; two features would be similar in this sense if they take similar values over a diverse set of data. In practice, this can be approximated by representing the feature as a vector, with indices corresponding to a fixed set of data points. We call the correlations between these vectors the activation similarity between features.
Logit weight similarity
A second natural approach is to think of a feature in terms of its downstream effects; two features would be similar in this sense if their activation changes their models' predictions in similar ways. In our one-layer model, a simple approximation to this is the logit weights. This approximation represents each feature as a vector with indices corresponding to vocabulary tokens. We call the correlations between these vectors the logit weight similarity between features.
Since Logit Weigth correlation represents prediction badly
Attribution similarity
We want to measure something more like "the actual effect a feature has on token probabilities." One way to get at this would be to compute a vector of ablation effects for every feature on every data point; Unfortunately, this would be rather expensive computationally. Instead, we scale the activation vector of a feature by the logit weights of the tokens that empirically come next in the dataset to produce an attribution vector.
One of those approaches was to engineer models to simply not have superposition in the first place. Unfortunately, having spent a significant amount of time investigating this approach, we have ultimately concluded that it is more fundamentally non-viable.
Do we only observe the same feature in divergent models? Anthropic observes substantial universality and at a high-level, this makes sense: if a feature is useful to one model in representing the dataset, it's likely useful to others, and if two models represent the same feature then a good dictionary learning algorithm should find it.
After matching feature neurons across models via activation correlation, they apply representational space similarity metrics like Singular Value Canonical Correlation. Their experiments reveal similarities in SAE feature spaces across various LLMs, providing evidence for feature universality.
Depends on seed and dataset (SAE Training)
Weight Cosine Similarity
Orphan features still shows high interpretability which indicates the different seed may have found the subset of the “idealized dictionary size”.
- seed - weight initialization matters → SAE geometric median helps to prevent this issue
- Weight cosine similarity + Hungarian Matching
It sets 1 - cosine similarity matrix to cost matrix and applies Hungarian Matching to find optimal 1:1 matching
- dataset - matters more than seed