CPU launch overhead in scatter_add and dispatch accounts for over half of total time. Tuning
num_sms from 24 → 128 → bandwidth improved by 2x+. Scales nearly linearly to 128 GPUs.Field Notes on Scaling MoE Expert Parallelism with DeepEP
Documenting the journey of scaling expert parallelism to achieve high-throughput pretraining.
https://nousresearch.com/moe-scaling-field-notes/


Seonglae Cho