Mixture of Factor Analyzers
Instead of interpreting features using global directions (like SAE), MFA proposes dividing the activation space into multiple "regions + subspaces". The activation space is modeled as several Gaussian regions (centroids) + low-dimensional subspaces within each region. Concepts may not be single directions, but rather clusters of nearby regions. Activation = "which region (centroid)" + "how it varies within that region (subspace)".
Localization / Steering Performance:
- On MIB, RAVEL, MCQA:→ Clearly better than SAE, PCA→ Similar to supervised methods (DAS)
- MFA also performs best on average for steering
Interpretability
- SAE: Combines multiple global directions → difficult to interpret
- MFA:centroid + local offset → mostly interpretable
- Interpretability rate:MFA 96% vs SAE 29%
arxiv.org
https://arxiv.org/pdf/2602.02464v1

Seonglae Cho