MoE Training

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Dec 22 23:35
Editor
Edited
Edited
2025 Dec 22 23:36
Refs
Refs
 
 
 
 
 
 
In MoE, the problem of routers only seeing a subset of experts during training is solved by replacing inactive expert outputs with their EMA (exponential moving average), which provides dense gradients to the router while keeping computation constant; Default MoE: inactive expert outputs are replaced with EMA of past outputs (default vector). Effects: faster convergence, improved perplexity/benchmarks, training stability↑, minimal memory·speed impact.
 
 

Recommendations