SAEBench

Creator
Creator
Seonglae Cho
Created
Created
2024 Dec 18 16:9
Editor
Edited
Edited
2025 Jan 20 23:41
Refs
Refs

Metrics

  • Feature Absorption (lower is better)
  • Spurious Correlation Removal (SCR): higher is better
  • Targeted Probe Perturbation (TPP)
  • Automated Interpretability
  • Sparse Probing
  • Reconstruction Error (L2 Loss)

Insights

  • Selective SAE
    (Top-k),
    Gated SAE
    (JumpReLU) perform better than the regular ReLU SAE but often result in higher Feature Absorption.
  • A small dictionary size improves interpretability, while a large dictionary size enhances reconstruction error.
  • Low sparsity is suitable for interpretability, whereas high sparsity is more effective for TPP.
  • TopK SAE has
    Sample efficiency
    , but longer training times may increase Feature Absorption.
Overall, If interpretability is important, the TopK SAE is a good choice with low sparsity. If you need to capture high-level context for complex tasks, it might be better to use the JumpReLU SAE with a wide dictionary size and high sparsity.
 
 
 

Explorer

Results

 
 

Backlinks

Adam Karvonen

Recommendations