SAE Probing

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Feb 19 12:45
Editor
Edited
Edited
2025 Dec 18 16:23

SAE for Classification

SAE probing uses learned sparse features as inputs to downstream classifiers.

Threshold Binarization logistic regression

Not using just one feature, but using entire dictionary as logistic regression's input
  • Wider SAE shows worse performance overall (contradicts with previous below research)
  • Threshold Binarization reduces computing costs and performance
arxiv.org

Top-n logistic regression

  • Wider SAE shows better performance
  • SAE probes outperform traditional methods on small datasets (<100 samples)
  • SAE probes maintain stable performance with label noise (due to being unsupervised)
  • Similar performance to traditional methods for OOD and class imbalance cases
  • For datasets with spurious correlations (distinguished by presence/absence of periods), it is interpretable and can be excluded
SAE Probing: What is it good for? Absolutely something! — LessWrong
Subhash and Josh are co-first authors. Work done as part of the two week research sprint in Neel Nanda’s MATS stream …
SAE Probing: What is it good for? Absolutely something! — LessWrong
SAE probing is one of the few practical downstream tasks for SAEs besides interpretability, but it still underperforms compared to SOTA methods.
Takeaways From Our Recent Work on SAE Probing — LessWrong
Subhash and Josh are co-first authors on this work done in Neel Nanda’s MATS stream. …
Takeaways From Our Recent Work on SAE Probing — LessWrong
arxiv.org

Linear probing is better demonstrated by Deepmind

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2) — LessWrong
Lewis Smith*, Sen Rajamanoharan*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda • * = equal contribution …
Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2) — LessWrong
Generally, linear probes only assign positive labels to specific tokens, but in reality, features are activated by being transferred to other tokens through attention, resulting in probes learning misclassified data and reduced feature prediction performance. Therefore, SAE features that are distributively activated and correspond to the model's internal state rather than text may have a comparative advantage in probing.
Adam Karvonen on Twitter / X
I'm skeptical that probes would be as good. For example, consider a middle layer SAE gender feature, where ablating it significantly reduces model gender bias. It will activate on e.g. gender pronouns, but also on many other seemingly unrelated tokens. Presumably this is because…— Adam Karvonen (@a_karvonen) May 17, 2025

Max pooling works better usually, since it is sparse and monosemantic

arxiv.org
arxiv.org
arxiv.org
 
 

Recommendations