Distributed Data Parallel
parameter, gradient, optimizer states for each GPU
- sampler를 통해 각 GPU에 서로 다른 데이터가 전송
- 각 데이터를 이용해서 모델 파라미터의 gradients를 계산
- All Reduce 연산을 통해 gradients 평균을 구한 뒤, 모든 GPU에 전달
- 이후 optimizer의 step을 통해 각 GPU에서 모델 파라미터가 업데이트
모두 똑같은 모델 정보 보장
Types
- Multi node
- Multi GPU
From PyTorch DDP to Accelerate to Trainer, mastery of distributed training with ease
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/blog/pytorch-ddp-accelerate-transformers
Distributed Data Parallel in PyTorch Tutorial Series
Suraj Subramanian breaks down why Distributed Training is an important part of your ML arsenal. The series starts with a simple non-distributed training job,...
https://www.youtube.com/playlist?list=PL_lsbAsL_o2CSuhUhJIiW0IkdT5C2wGWj


Seonglae Cho