Random Weight Initialization
Strong Inductive Bias of inherent model
Embedding training only models work for simple pattern matching.
SAE only single token features
Is SAE really meaningful or it is the property of sparse text dataset? (that only single-token features are discovered) I suspect the results are more due to the structure of the transformer itself rather than the superposition in token embeddings, since the comparison between random and intact token embeddings showed similar outcomes. This makes me curious about how these findings would generalize to other architectures.