Traditional MoE uses token choice routing → certain experts become overloaded while others are undertrained. Each expert has a fixed capacity, and within that capacity, selects the most important tokens via top-k. This naturally achieves perfect load balancing. Tokens can be mapped to 1~N experts as needed → allows variable number of experts per token. Not top-k based.
Compared to GShard/GLaM: 2x faster training convergence. Each step is also 20% faster (no capacity overprovisioning needed). Average +2% improvement on GLUE/SuperGLUE fine-tuning. Eliminates load imbalance between experts, automatically adjusts number of experts based on token difficulty, removes expert over/under-training issues, perplexity improves better at larger scales.
Mixture-of-Experts with Expert Choice Routing

Seonglae Cho