Google DeepMind has released a new paper on distributed training titled 'Streaming DiLoCo with Overlapping Communication.' This research introduces methods for training data-parallel models globally while minimizing bandwidth usage, achieving a reduction of 400 times in the amount of data exchanged. Key innovations include streaming partial updates, which allow synchronization of only a subset of gradients, and overlapping communication with computation, enabling synchronization during the computation of inner steps. The approach aims to lower barriers for communities looking to build their own models, making distributed training more accessible.
[CL] Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch A Douillard, Y Donchev, K Rush, S Kale... [Google DeepMind] (2025) https://t.co/KDpLjtQPel https://t.co/rOKe4dWAwm
🏷️:Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch 🔗:https://t.co/ouVsa5fBwM https://t.co/uZmfmHCf1K
great new DiLoCo work by @Ar_Douillard et al Key points: - Streaming partial updates: sync only a subset of gradients at a time - Overlapping communication with computation: synchronizing while computing inner steps ⁃ Pseudo-gradient quantization: as used in INTELLECT-1,… https://t.co/aXcV0doQS3