Spurious Interpretability

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Jun 28 22:17
Editor
Edited
Edited
2025 Jun 28 22:18
Refs
Refs
 
 
 
 
 
 

Depends on seed and dataset (
SAE Training
)

Weight Cosine Similarity
Orphan features still shows high interpretability which indicates the different seed may have found the subset of the “idealized dictionary size”.
  • dataset - matters more than seed

FaithfulSAE
FaithfulSAE
seonglaeUpdated 2025 Aug 21 15:27

High-frequency features and feature set stability are problems in SAE. This approach attempts to solve these issues by eliminating external dataset dependency and increasing faithfulness through self-synthetic datasets. It defines high-frequent features as FFR (Fake Feature Ratio) and uses the existing SFR (Shared Feature Ratio) to demonstrate that Faithful SAE improves both metrics.
This is a rare attempt to improve SAE from a dataset perspective, highlighting the issue of external dependencies in interpretability.
 
 
 

Backlinks

SAE Training

Recommendations