SAE Feature Matching

Creator
Creator
Seonglae Cho
Created
Created
2024 Oct 24 11:30
Editor
Edited
Edited
2025 Feb 24 1:14

Match Features Across Layers

  • 대부분의 경우 계층 간의 feature는 순서만 바뀌거나 약간의 변형만 있을 뿐
  • 계층이 가까울수록 동일한 feature가 많이 유지되며, 순서만 바뀌는 비율이 높은 편
  • 하지만 멀어지면 MSE값이 높아지면서 새로운 feature도 분명히 나타난다.

Methods

  • Activation Correlating - high cost
  • Weight Cosine Similarity - low cost
 
 

Mechanistic Permutability

Decoder Weight MSE

Evolution though layer

Activation Correlating
  • Features contain more specialized information as they progress to higher layers.
  • Features emerge as logical combinations (AND/OR).
  • Some features are simply passed through layers (inefficiency)

SAE feature flow along the layers

Weight Cosine Similarity

Depends on seed and dataset (
SAE Training
)

Weight Cosine Similarity
Orphan features still shows high interpretability which indicates the different seed may have found the subset of the “idealized dictionary size”.
  • dataset - matters more than seed
 
 
 
 

Recommendations