Redesigns the computation graph to minimize activation storage during the backward pass, significantly reducing memory usage. Aligns the number of tokens assigned to experts with GPU GEMM tile sizes to eliminate unnecessary padding operations and improve processing speed. When training a 7B MoE model, throughput increases by 1.86× compared to ScatterMoE, and token rounding provides an additional ~16% improvement.
arxiv.org
https://arxiv.org/pdf/2512.14080v1

Seonglae Cho