DeepEP

Both intranode and internode support with

CPU launch overhead in scatter_add and dispatch accounts for over half of total time. Tuning num_sms from 24 → 128 → bandwidth improved by 2x+. Scales nearly linearly to 128 GPUs.

Field Notes on Scaling MoE Expert Parallelism with DeepEP

Documenting the journey of scaling expert parallelism to achieve high-throughput pretraining.

https://nousresearch.com/moe-scaling-field-notes/

DeepEP

Recommendations