MFA

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Feb 10 18:14
Editor
Edited
Edited
2026 Feb 10 18:24

Mixture of Factor Analyzers

Instead of interpreting features using global directions (like SAE), MFA proposes dividing the activation space into multiple "regions + subspaces". The activation space is modeled as several Gaussian regions (centroids) + low-dimensional subspaces within each region. Concepts may not be single directions, but rather clusters of nearby regions. Activation = "which region (centroid)" + "how it varies within that region (subspace)".
Localization / Steering Performance:
  • On MIB, RAVEL, MCQA:→ Clearly better than SAE, PCA→ Similar to supervised methods (DAS)
  • MFA also performs best on average for steering
Interpretability
  • SAE: Combines multiple global directions → difficult to interpret
  • MFA:centroid + local offset → mostly interpretable
  • Interpretability rate:MFA 96% vs SAE 29%
 
 
 
arxiv.org
 

 

Recommendations