SAE Feature for Classification

Creator
Creator
Seonglae Cho
Created
Created
2025 Feb 19 12:45
Editor
Edited
Edited
2025 Mar 6 20:56
Refs
Refs

SAE proving

 
 
 
 

Threshold Binarization logistic regression

Not using just one feature, but using entire dictionary as logistic regression's input
  • Wider SAE shows worse performance overall (contradicts with previous below research)
  • Threshold Binarization reduces computing costs and performance

Top-n logistic regression

  • Wider SAE shows better performance
  • SAE probes outperform traditional methods on small datasets (<100 samples)
  • SAE probes maintain stable performance with label noise (due to being unsupervised)
  • Similar performance to traditional methods for OOD and class imbalance cases
  • For datasets with spurious correlations (distinguished by presence/absence of periods), it is interpretable and can be excluded
SAE probing is one of the few practical downstream tasks for SAEs besides interpretability, but it still underperforms compared to SOTA methods.
 
 

Recommendations