torch.distributed.fsdp

Creator

Creator

Created

Created

2024 Feb 7 9:48

Editor

Editor

Edited

Edited

2024 Nov 29 21:34

Refs

Refs

wrap policy란, FSDP에서 모델을 어떻게 나눌 것인가에 대한 policy

PyTorch에서는 기본적으로 transformer 모델에 대한 policy를 지원

Random port with firewall issue

torch.distributed server back-connects to clients on random ports leading to firewall problems

Updated 2024 Jul 19 15:25

[RFC] Change --standalone to bind to a random port

Updated 2023 Aug 26 4:20

Getting Started with Fully Sharded Data Parallel(FSDP) — PyTorch Tutorials 2.2.1+cu121 documentation

Training AI models at a large scale is a challenging task that requires a lot of compute power and resources. It also comes with considerable engineering complexity to handle the training of these very large models. PyTorch FSDP, released in PyTorch 1.11 makes this easier.

https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html

fsdp2

Supercharging Training using float8 and FSDP2

IBM: Tuan Hoang Trong, Alexei Karve, Yan Koyfman, Linsong Chu, Divya Kumari, Shweta Salaria, Robert Walkup, Praneet Adusumilli, Nirmit Desai, Raghu Ganti, Seetharami Seelam Meta: Less Wright, Wei Feng, Vasiliy Kuznetsov, Driss Guesseous

https://pytorch.org/blog/training-using-float8-fsdp2/

Supercharging Training using float8 and FSDP2

Recommendations

/////////