Distributed Low-Communication
Streaming DiLoCo by Deepmind
- Synchronize only subsets of parameters in sequence, rather than all at once, which greatly reduces peak bandwidth
- Allow workers to continue training while synchronizing, which decreases wall clock time
- Quantize the data exchanged by workers, which further reduces bandwidth across workers