DualPipe overlaps the computation and communication within a pair of individual forward and backward chunks.
Each chunk into four components
attention
all-to-all
dispatch
MLP
all-to-all
combine
DualPipe algorithm for efficient pipeline parallelism, which has fewer pipeline bubbles and hides most of the communication during training through computation-communication overlap. This overlap ensures that, as the model further scales up, as long as we maintain a constant computation-to-communication ratio, we can still employ fine-grained experts across nodes while achieving a near-zero all-to-all communication overhead.
Properties
- DualPipe not only accelerates model training by effectively overlapping forward and backward computation-communication phases, but also reduces the pipeline bubbles
- Although DualPipe requires keeping two copies of the model parameters, this does not significantly increase the memory consumption since we use a large EP size during training.