In MoE, the problem of routers only seeing a subset of experts during training is solved by replacing inactive expert outputs with their EMA (exponential moving average), which provides dense gradients to the router while keeping computation constant; Default MoE: inactive expert outputs are replaced with EMA of past outputs (default vector). Effects: faster convergence, improved perplexity/benchmarks, training stability↑, minimal memory·speed impact.
arxiv.org
https://arxiv.org/pdf/2504.12463v3

Seonglae Cho