Data parallelism distributes training data across multiple GPUs, which generally works out of the box with proper attention to data loading mechanisms.
The main goal is to increase the number of Batch Processing instances that can be processed simultaneously. However, this approach faces two main challenges: computing efficiency for gradient all-reduce operations and maintaining training efficiency with increased batch sizes.
The process works in three main steps:
- Distribution of input data and model parameters to each GPU
- During backward propagation, gradients are computed separately for data on each GPU
- Finally, all gradients are collected and consolidated for model updates
Due to these overhead operations, traditional Data Parallelism (DP) doesn't scale linearly with the number of GPUs. Typically, one GPU acts as a control tower, which creates a bottleneck. Various approaches have emerged to address this limitation.
Distributed Data Parallel (DDP) is consistently superior to standard DP, while the choice between DDP and Fully Sharded Data Parallel (FSDP) depends on specific use cases and their trade-offs.
Limitation
Internal states and parameter gradients need to be exchanged at each and every single gradient step, all devices need to be co-located using low-latency high-bandwidth communication links.
DiLoCo have relaxed such co-location constraint: accelerators can be grouped into “workers”, where synchronizations between workers only occur infrequently.
Data Parallelism Usages