Metrics
- Feature Absorption (lower is better)
- Spurious Correlation Removal (SCR): higher is better
- Targeted Probe Perturbation (TPP)
- Automated Interpretability
- Sparse Probing
- Reconstruction Error (L2 Loss)
Insights
- Selective SAE (Top-k), Gated SAE (JumpReLU) perform better than the regular ReLU SAE but often result in higher Feature Absorption.
- A small dictionary size improves interpretability, while a large dictionary size enhances reconstruction error.
- Low sparsity is suitable for interpretability, whereas high sparsity is more effective for TPP.
- TopK SAE has Sample efficiency, but longer training times may increase Feature Absorption.
Overall, If interpretability is important, the TopK SAE is a good choice with low sparsity. If you need to capture high-level context for complex tasks, it might be better to use the JumpReLU SAE with a wide dictionary size and high sparsity.