SAE Dataset

Creator
Creator
Seonglae Cho
Created
Created
2025 Feb 15 20:44
Editor
Edited
Edited
2025 Jul 7 1:40
Sparse Autoencoder can learn meaningful features with relatively less data compared to foundation model pretraining.
 
 

Dataset (
The Pile
,
OpenWebText
,
TinyStories
)

Depends on seed and dataset (
SAE Training
)

Weight Cosine Similarity
Orphan features still shows high interpretability which indicates the different seed may have found the subset of the “idealized dictionary size”.
  • dataset - matters more than seed
There are many features that have not yet been discovered in the model. The existence of features in SAE is closely linked to the frequency in dataset of the corresponding concepts within the dictionary.
For example, chemical concepts that are frequently mentioned in the training data almost always have features in the dictionary, while concepts that are rarely mentioned often lack features.

FaithfulSAE
FaithfulSAE
seonglaeUpdated 2025 Jul 9 0:31

High-frequency features and feature set stability are problems in SAE. This approach attempts to solve these issues by eliminating external dataset dependency and increasing faithfulness through self-synthetic datasets. It defines high-frequent features as FFR (Fake Feature Ratio) and uses the existing SFR (Shared Feature Ratio) to demonstrate that Faithful SAE improves both metrics.
This is a rare attempt to improve SAE from a dataset perspective, highlighting the issue of external dependencies in interpretability.

Domain-specific SAE

SSAE: Instead of using the entire dataset, it focuses on learning rare concepts (like dark matter) in specific subdomains (e.g., physics, toxicity identified through Dense Retrieval using seed keywords). This captures more tail concepts with fewer features than standard SAEs. The approach fine-tunes a GSAE using the TERM loss on subdomain data collected via Dense Retrieval (or TracIn).
 
 
 
 
 

Recommendations