SAEBench

Creator

Creator

Created

Created

2024 Dec 18 16:9

Editor

Editor

Edited

Edited

2025 Jan 20 23:41

Refs

Refs

Metrics

Feature Absorption (lower is better)

Spurious Correlation Removal (SCR): higher is better

Targeted Probe Perturbation (TPP)

Automated Interpretability

Sparse Probing

Reconstruction Error (L2 Loss)

Insights

Selective SAE (Top-k),
Gated SAE (JumpReLU) perform better than the regular ReLU SAE but often result in higher Feature Absorption.

A small dictionary size improves interpretability, while a large dictionary size enhances reconstruction error.

Low sparsity is suitable for interpretability, whereas high sparsity is more effective for TPP.

TopK SAE has
Sample efficiency, but longer training times may increase Feature Absorption.

Overall, If interpretability is important, the TopK SAE is a good choice with low sparsity. If you need to capture high-level context for complex tasks, it might be better to use the JumpReLU SAE with a wide dictionary size and high sparsity.

adamkarvonen • Updated 2025 Feb 27 4:56

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders - Dec 2024

Adam Karvonen*, Can Rager*, Johnny Lin*, Curt Tigges*, Joseph Bloom*, David Chanin, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, Callum McDougall, Kola Ayonrinde, Matthew Wearden, Samuel Marks, Neel Nanda *equal contribution

https://www.neuronpedia.org/sae-bench/info

Explorer

SAE Bench - Evals

https://www.neuronpedia.org/sae-bench

Results

adamkarvonen/new_sae_bench_results at main

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

adamkarvonen/new_sae_bench_results at main

https://huggingface.co/datasets/adamkarvonen/new_sae_bench_results/tree/main/core_with_feature_statistics

adamkarvonen/new_sae_bench_results at main

Backlinks

Recommendations

////////////