SAE Proving

Creator

Creator

Created

Created

2025 Feb 19 12:45

Editor

Editor

Edited

Edited

2025 May 11 22:5

Refs

Refs

SAE for Classification

Threshold Binarization logistic regression

Not using just one feature, but using entire dictionary as logistic regression's input

Wider SAE shows worse performance overall (contradicts with previous below research)

Threshold Binarization reduces computing costs and performance

https://arxiv.org/pdf/2502.11367

Top-n logistic regression

Wider SAE shows better performance

SAE probes outperform traditional methods on small datasets (<100 samples)

SAE probes maintain stable performance with label noise (due to being unsupervised)

Similar performance to traditional methods for OOD and class imbalance cases

For datasets with spurious correlations (distinguished by presence/absence of periods), it is interpretable and can be excluded

SAE Probing: What is it good for? Absolutely something! — LessWrong

Subhash and Josh are co-first authors. Work done as part of the two week research sprint in Neel Nanda’s MATS stream …

SAE Probing: What is it good for? Absolutely something! — LessWrong

https://www.lesswrong.com/posts/NMLq8yoTecAF44KX9

SAE Probing: What is it good for? Absolutely something! — LessWrong

SAE probing is one of the few practical downstream tasks for SAEs besides interpretability, but it still underperforms compared to SOTA methods.

Takeaways From Our Recent Work on SAE Probing — LessWrong

Subhash and Josh are co-first authors on this work done in Neel Nanda’s MATS stream. …

Takeaways From Our Recent Work on SAE Probing — LessWrong

https://www.lesswrong.com/posts/osNKnwiJWHxDYvQTD/takeaways-from-our-recent-work-on-sae-probing

Takeaways From Our Recent Work on SAE Probing — LessWrong

https://arxiv.org/pdf/2502.16681

Linear proving is better demonstrated by Deepmind

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2) — LessWrong

Lewis Smith*, Sen Rajamanoharan*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda • * = equal contribution …

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2) — LessWrong

https://www.lesswrong.com/posts/4uXCAJNuPKtKBsi28/sae-progress-update-2-draft#Dataset_debugging_with_SAEs

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2) — LessWrong

Recommendations

///////////