The biggest problem with Sparse MoE
Dense Backpropagation Improves Training for Sparse Mixture-of-Experts
Mixture of Experts (MoE) pretraining is more scalable than dense Transformer pretraining, because MoEs learn to route inputs to a sparse set of their feedforward parameters. However, this means...
https://arxiv.org/abs/2504.12463v3


Seonglae Cho