SAEBench

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Dec 18 16:9
Editor
Edited
Edited
2025 Jan 20 23:41
Refs
Refs

Metrics

  • Feature Absorption (lower is better)
  • Spurious Correlation Removal (SCR): higher is better
  • Targeted Probe Perturbation (TPP)
  • Automated Interpretability
  • Sparse Probing
  • Reconstruction Error (L2 Loss)

Insights

  • Selective SAE
    (Top-k),
    Gated SAE
    (JumpReLU) perform better than the regular ReLU SAE but often result in higher Feature Absorption.
  • A small dictionary size improves interpretability, while a large dictionary size enhances reconstruction error.
  • Low sparsity is suitable for interpretability, whereas high sparsity is more effective for TPP.
  • TopK SAE has
    Sample efficiency
    , but longer training times may increase Feature Absorption.
Overall, If interpretability is important, the TopK SAE is a good choice with low sparsity. If you need to capture high-level context for complex tasks, it might be better to use the JumpReLU SAE with a wide dictionary size and high sparsity.
 
 
 
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders - Dec 2024
Adam Karvonen*, Can Rager*, Johnny Lin*, Curt Tigges*, Joseph Bloom*, David Chanin, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, Callum McDougall, Kola Ayonrinde, Matthew Wearden, Samuel Marks, Neel Nanda *equal contribution

Explorer

SAE Bench - Evals

Results

adamkarvonen/new_sae_bench_results at main
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
adamkarvonen/new_sae_bench_results at main
 
 

Recommendations