MFA

Mixture of Factor Analyzers

Instead of interpreting features using global directions (like SAE), MFA proposes dividing the activation space into multiple "regions + subspaces". The activation space is modeled as several Gaussian regions (centroids) + low-dimensional subspaces within each region. Concepts may not be single directions, but rather clusters of nearby regions. Activation = "which region (centroid)" + "how it varies within that region (subspace)".

Localization / Steering Performance:

On MIB, RAVEL, MCQA:→ Clearly better than SAE, PCA→ Similar to supervised methods (DAS)

MFA also performs best on average for steering

Interpretability

SAE: Combines multiple global directions → difficult to interpret

MFA:centroid + local offset → mostly interpretable

Interpretability rate:MFA 96% vs SAE 29%

arxiv.org

https://arxiv.org/pdf/2602.02464v1

MFA

Mixture of Factor Analyzers

Recommendations