SAE feature bimodality

Creator
Creator
Seonglae Cho
Created
Created
2025 Jan 28 17:3
Editor
Edited
Edited
2025 Feb 1 23:31
Refs
Refs
  • When calculating MCS between features extracted from specific dictionaries, the resulting distribution has two peaks:
    • One is the region where MCS values are close to 1 (cases composed of similar features).
    • The other is the region where MCS values are around 0.3 (features that are dissimilar and appear random).
  • This phenomenon is called bimodality, and each peak likely indicates the following:
    • Features with high MCS values: These are likely meaningful "real" features that are repeatedly found across multiple dictionaries.
    • Features with low MCS values: These are non-meaningful features that appear like random noise or dead neurons.
*MSD (maximum cosine similarity)
 
 
 
The ultralow density cluster appears to be an artifact of the autoencoder training process and not a real property of the underlying transformer
 

Recommendations