MoE Training

In MoE, the problem of routers only seeing a subset of experts during training is solved by replacing inactive expert outputs with their EMA (exponential moving average), which provides dense gradients to the router while keeping computation constant; Default MoE: inactive expert outputs are replaced with EMA of past outputs (default vector). Effects: faster convergence, improved perplexity/benchmarks, training stability↑, minimal memory·speed impact.

arxiv.org

https://arxiv.org/pdf/2504.12463v3

MoE Training

Recommendations