MoE Interpretability

Challenges the conventional belief that there is an inevitable "performance ↔ interpretability tradeoff".

Non-shared MLP dimensions are more monosemantic compared to single model.

Phase Change does not occur during training since suspectably

AI Feature Dimensionality's drastic changes do not happen. Based on this, reinterpreting the definition of Expert Specialization means that instead of load balancing, it is about monosemantically representing specific features. Experts maintain high monosemanticity for the features they were initially assigned to.

arxiv.org

https://arxiv.org/pdf/2510.23671

MoE Interpretability

Recommendations