Expert Choice Routing

Traditional MoE uses token choice routing → certain experts become overloaded while others are undertrained. Each expert has a fixed capacity, and within that capacity, selects the most important tokens via top-k. This naturally achieves perfect load balancing. Tokens can be mapped to 1~N experts as needed → allows variable number of experts per token. Not top-k based.

Compared to GShard/GLaM: 2x faster training convergence. Each step is also 20% faster (no capacity overprovisioning needed). Average +2% improvement on GLUE/SuperGLUE fine-tuning. Eliminates load imbalance between experts, automatically adjusts number of experts based on token difficulty, removes expert over/under-training issues, perplexity improves better at larger scales.

Mixture-of-Experts with Expert Choice Routing

Mixture-of-Experts with Expert Choice Routing

https://ai.googleblog.com/2022/11/mixture-of-experts-with-expert-choice.html

Expert Choice Routing

Recommendations