In MoE, the problem of routers only seeing a subset of experts during training is solved by replacing inactive expert outputs with their EMA (exponential moving average), which provides dense gradients to the router while keeping computation constant; Default MoE: inactive expert outputs are replaced with EMA of past outputs (default vector). Effects: faster convergence, improved perplexity/benchmarks, training stability↑, minimal memory·speed impact.
MoE Training
Creator
Creator
Seonglae ChoCreated
Created
2025 Dec 22 23:35Editor
Editor
Seonglae ChoEdited
Edited
2025 Dec 22 23:36Refs
Refs
