MoE Interpretability

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Oct 31 10:58
Editor
Edited
Edited
2025 Oct 31 11:2
Refs
Refs
 
 
 
 
 
 
 
Challenges the conventional belief that there is an inevitable "performance ↔ interpretability tradeoff".
Non-shared MLP dimensions are more monosemantic compared to single model.
Phase Change
does not occur during training since suspectably
AI Feature Dimensionality
's drastic changes do not happen. Based on this, reinterpreting the definition of Expert Specialization means that instead of load balancing, it is about monosemantically representing specific features. Experts maintain high monosemanticity for the features they were initially assigned to.
 
 

Recommendations