Challenges the conventional belief that there is an inevitable "performance ↔ interpretability tradeoff".
Non-shared MLP dimensions are more monosemantic compared to single model. Phase Change does not occur during training since suspectably AI Feature Dimensionality's drastic changes do not happen. Based on this, reinterpreting the definition of Expert Specialization means that instead of load balancing, it is about monosemantically representing specific features. Experts maintain high monosemanticity for the features they were initially assigned to.
Intrinsically Interpretable
In MoE LLMs, there exist experts (FFN experts) strongly associated with specific behaviors (safety, lying, factuality, etc.). By detecting these experts and forcing them on or off at inference-time, we can significantly control the model's behavior without fine-tuning. steering
Analyzes how MoE LLMs benefit from structural sparsity, enabling experts to become more monosemantic (dedicated to single concepts)

Seonglae Cho