NCCL

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 May 30 8:10
Editor
Edited
Edited
2024 Jul 7 2:5

NVIDIA Collective Communication Library

Windows are not supported

Cross-GPU tensor communication

Almost every CUDA based multi GPU training/inference server use NCCL
  • NCCL_SOCKET_IFNAME
  • NCCL_DEBUG

Using torch

Install

 
 

Install

Abstract
This NVIDIA Collective Communication Library (NCCL) Installation Guide provides a step-by-step instructions for downloading and installing NCCL.

Env

Environment Variables — NCCL 2.20.3 documentation
NCCL has an extensive set of environment variables to tune for specific usage.
NVIDIA Collective Communications Library (NCCL)
The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes.
NVIDIA Collective Communications Library (NCCL)
Operations — NCCL 2.6.4 documentation
Like MPI collective operations, NCCL collective operations have to be called for each rank (hence CUDA device) to form a complete collective operation. Failure to do so will result in other ranks waiting indefinitely.

Timeout

Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels
I just found this GH Issue: huggingface/accelerate#223 and it seems that we can add a timeout argument to the Accelerator constructor (default is 1800 When you load or tokenize a large dataset for the first time, NCCL may timeout. HuggingFace caches tokenization, so when you train on the same dataset and tokenizer, you shouldn’t face the issue again.
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels
 
 

Recommendations