arxiv.org
https://arxiv.org/pdf/2104.07143
Depends on seed and dataset (SAE Training)
Weight Cosine Similarity
Orphan features still shows high interpretability which indicates the different seed may have found the subset of the “idealized dictionary size”.
- seed - weight initialization matters → SAE weight initialization helps to prevent this issue
- Weight cosine similarity + Hungarian Matching
It sets 1 - cosine similarity matrix to cost matrix and applies Hungarian Matching to find optimal 1:1 matching
arxiv.org
https://arxiv.org/pdf/2501.16615
- dataset - matters more than seed
FaithfulSAE FaithfulSAEseonglae • Updated 2026 May 12 10:26
FaithfulSAE
seonglae • Updated 2026 May 12 10:26
High-frequency features and feature set stability are problems in SAE. This approach attempts to solve these issues by eliminating external dataset dependency and increasing faithfulness through self-synthetic datasets. It defines high-frequent features as FFR (Fake Feature Ratio) and uses the existing SFR (Shared Feature Ratio) to demonstrate that Faithful SAE improves both metrics.
This is a rare attempt to improve SAE from a dataset perspective, highlighting the issue of external dependencies in interpretability.
www.arxiv.org
https://www.arxiv.org/pdf/2506.17673
Geometry-Adaptive Explainer for Faithful Dictionary
Geometry-Adaptive Explainer for Faithful Dictionary-Based...
Mechanistic interpretability aims to explain a model's behavior by identifying causally responsible internal structures. Dictionary-based explainers such as sparse autoencoders and transcoders are...
https://arxiv.org/abs/2605.21849


Seonglae Cho