PolySAE

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Feb 10 18:29
Editor
Edited
Edited
2026 Feb 10 18:31
Refs
Refs
SAE is linear, so features combine only through "addition." This means it cannot distinguish whether "star" + "coffee" represents a specific combination like 'Starbucks' or just co-occurrence, leading to a tendency to capture complex concepts as a whole (single feature).
Keep the encoder linear (preserving interpretability) while expanding only the decoder to polynomial (2nd/3rd order) to model feature interactions (multiplicative combinations). Since interactions would explode to if left as-is, we use low-rank factorization in a shared low-dimensional subspace U for efficiency.
The learned interaction weights show almost no correlation with co-occurrence frequency (r≈0.06), whereas vanilla SAE's covariance is strongly drawn to frequency (r≈0.82). This suggests that PolySAE's interactions are not simply replicating bigram statistics, but rather better capture composition such as morpheme combination, phrase composition, and context-based semantic decomposition.
 
 
 
 
arxiv.org
 
 

Recommendations