NCCL

Creator

Creator

Seonglae Cho

Created

Created

2023 May 30 8:10

Editor

Editor

Seonglae Cho

Edited

Edited

2024 Jul 7 2:5

Refs

Refs

NVIDIA • Updated 2023 Aug 19 16:51

NVIDIA Collective Communication Library

Windows are not supported

Cross-GPU tensor communication

Almost every CUDA based multi GPU training/inference server use NCCL

NCCL_SOCKET_IFNAME

NCCL_DEBUG

Using torch

Install

Install

This NVIDIA Collective Communication Library (NCCL) Installation Guide provides a step-by-step instructions for downloading and installing NCCL.

https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html

Env

Environment Variables — NCCL 2.20.3 documentation

NCCL has an extensive set of environment variables to tune for specific usage.

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html

NVIDIA Collective Communications Library (NCCL)

The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes.

https://developer.nvidia.com/nccl

NVIDIA Collective Communications Library (NCCL)

Operations — NCCL 2.6.4 documentation

Like MPI collective operations, NCCL collective operations have to be called for each rank (hence CUDA device) to form a complete collective operation. Failure to do so will result in other ranks waiting indefinitely.

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/operations.html

Timeout

Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels

I just found this GH Issue: huggingface/accelerate#223 and it seems that we can add a timeout argument to the Accelerator constructor (default is 1800 When you load or tokenize a large dataset for the first time, NCCL may timeout. HuggingFace caches tokenization, so when you train on the same dataset and tokenizer, you shouldn’t face the issue again.

Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels

https://discuss.huggingface.co/t/some-nccl-operations-have-failed-or-timed-out-due-to-the-asynchronous-nature-of-cuda-kernels/26877/4

Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels

Backlinks

torch.distributed InfiniBand AI Inference Tool torch.distributed.c10d AI Inference Tool

Recommendations

/////