SAE feature bimodality

Creator

Creator

Created

Created

2025 Jan 28 17:3

Editor

Editor

Edited

Edited

2025 Mar 27 21:41

Refs

Refs

When calculating MCS between features extracted from specific dictionaries, the resulting distribution has two peaks:

One is the region where MCS values are close to 1 (cases composed of similar features).
The other is the region where MCS values are around 0.3 (features that are dissimilar and appear random).

This phenomenon is called bimodality, and each peak likely indicates the following:

Features with high MCS values: These are likely meaningful "real" features that are repeatedly found across multiple dictionaries.
Features with low MCS values: These are non-meaningful features that appear like random noise or dead neurons.

*MSD (maximum cosine similarity)

[Research Update] Sparse Autoencoder features are bimodal

https://aizi.substack.com/p/research-update-sparse-autoencoder

[Research Update] Sparse Autoencoder features are bimodal

The ultralow density cluster appears to be an artifact of the autoencoder training process and not a real property of the underlying transformer

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

https://transformer-circuits.pub/2023/monosemantic-features#appendix-feature-density

Backlinks

SAE Limitation Logit Interpretability

Recommendations

////////////