NVIDIA Collective Communication Library
Windows are not supported
Cross-GPU tensor communication
Almost every CUDA based multi GPU training/inference server use NCCL
NCCL_SOCKET_IFNAME
NCCL_DEBUG
Using torch
Install
Install
Abstract
This NVIDIA Collective Communication Library (NCCL) Installation Guide provides a step-by-step instructions for downloading and installing NCCL.
https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html
Env
Environment Variables — NCCL 2.20.3 documentation
NCCL has an extensive set of environment variables to tune for specific usage.
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
NVIDIA Collective Communications Library (NCCL)
The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes.
https://developer.nvidia.com/nccl

Operations — NCCL 2.6.4 documentation
Like MPI collective operations, NCCL collective operations have to be called for each rank (hence CUDA device) to form a complete collective operation. Failure to do so will result in other ranks waiting indefinitely.
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/operations.html
Timeout
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels
I just found this GH Issue: huggingface/accelerate#223 and it seems that we can add a timeout argument to the Accelerator constructor (default is 1800 When you load or tokenize a large dataset for the first time, NCCL may timeout. HuggingFace caches tokenization, so when you train on the same dataset and tokenizer, you shouldn’t face the issue again.
https://discuss.huggingface.co/t/some-nccl-operations-have-failed-or-timed-out-due-to-the-asynchronous-nature-of-cuda-kernels/26877/4


Seonglae Cho