Sparse Autoencoder can learn meaningful features with relatively less data compared to foundation model pretraining.
Dataset (The Pile, OpenWebText, TinyStories)
Depends on seed and dataset (SAE Training)
Weight Cosine Similarity
Orphan features still shows high interpretability which indicates the different seed may have found the subset of the “idealized dictionary size”.
- seed - weight initialization matters → SAE weight initialization helps to prevent this issue
- Weight cosine similarity + Hungarian Matching
It sets 1 - cosine similarity matrix to cost matrix and applies Hungarian Matching to find optimal 1:1 matching
- dataset - matters more than seed
There are many features that have not yet been discovered in the model. The existence of features in SAE is closely linked to the frequency in dataset of the corresponding concepts within the dictionary.
For example, chemical concepts that are frequently mentioned in the training data almost always have features in the dictionary, while concepts that are rarely mentioned often lack features.
Dataset dependant
SAE trained on chat-specific dataset (LmSys-Chat-1M) reconstructs the "refusal direction" in a much sparser and more interpretable way compared to SAE trained on the Pile dataset.