When experts are distributed across multiple nodes, transmitting data to too many nodes for processing a single token increases inter-node communication costs. Therefore, the selected experts are limited to span across a maximum of M nodes (servers or GPU clusters) among the distributed nodes. Specifically, from the experts distributed across each node, nodes are selected based on the sum of affinity scores of experts present in that node in descending order, ultimately limiting the number of nodes each token is sent to M.
This approach significantly reduces inter-node communication costs during distributed training, allowing computation and communication to be almost completely overlapped.