- When calculating MCS between features extracted from specific dictionaries, the resulting distribution has two peaks:
- One is the region where MCS values are close to 1 (cases composed of similar features).
- The other is the region where MCS values are around 0.3 (features that are dissimilar and appear random).
- This phenomenon is called bimodality, and each peak likely indicates the following:
- Features with high MCS values: These are likely meaningful "real" features that are repeatedly found across multiple dictionaries.
- Features with low MCS values: These are non-meaningful features that appear like random noise or dead neurons.
*MSD (maximum cosine similarity)
[Research Update] Sparse Autoencoder features are bimodal
Overview
https://aizi.substack.com/p/research-update-sparse-autoencoder
![[Research Update] Sparse Autoencoder features are bimodal](https://www.notion.so/image/https%3A%2F%2Fsubstackcdn.com%2Fimage%2Ffetch%2Ff_auto%2Cq_auto%3Abest%2Cfl_progressive%3Asteep%2Fhttps%253A%252F%252Faizi.substack.com%252Fapi%252Fv1%252Fpost_preview%252F129725701%252Ftwitter.jpg%253Fversion%253D4?table=block&id=189c3c96-247d-8065-ba89-db95e7b7df92&cache=v2)
The ultralow density cluster appears to be an artifact of the autoencoder training process and not a real property of the underlying transformer
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.
https://transformer-circuits.pub/2023/monosemantic-features#appendix-feature-density

Seonglae Cho