Match Features Across Layers
- 대부분의 경우 계층 간의 feature는 순서만 바뀌거나 약간의 변형만 있을 뿐
- 계층이 가까울수록 동일한 feature가 많이 유지되며, 순서만 바뀌는 비율이 높은 편
- 하지만 멀어지면 MSE값이 높아지면서 새로운 feature도 분명히 나타난다.
Methods
- Activation Correlating - high cost
- Weight Cosine Similarity - low cost
Mechanistic Permutability
Decoder Weight MSE
Evolution though layer
Activation Correlating
- Features contain more specialized information as they progress to higher layers.
- Features emerge as logical combinations (AND/OR).
- Some features are simply passed through layers (inefficiency)
SAE feature flow along the layers
Weight Cosine Similarity
Depends on seed and dataset (SAE Training)
Weight Cosine Similarity
Orphan features still shows high interpretability which indicates the different seed may have found the subset of the “idealized dictionary size”.
- seed - weight initialization matters → SAE weight initialization helps to prevent this issue
- Weight cosine similarity + Hungarian Matching
It sets 1 - cosine similarity matrix to cost matrix and applies Hungarian Matching to find optimal 1:1 matching
- dataset - matters more than seed