Redesigns the computation graph to minimize activation storage during the backward pass, significantly reducing memory usage. Aligns the number of tokens assigned to experts with GPU GEMM tile sizes to eliminate unnecessary padding operations and improve processing speed. When training a 7B MoE model, throughput increases by 1.86× compared to ScatterMoE, and token rounding provides an additional ~16% improvement.
SonicMoE
Creator
Creator
Seonglae ChoCreated
Created
2026 Jan 9 15:9Editor
Editor
Seonglae ChoEdited
Edited
2026 Jan 9 15:11Refs
Refs
