Torch Elastic Launch
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=123.456.123.456 --master_port=1234 train.py torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr=123.456.123.456 --master_port=1234 train.py
torchrun (Elastic Launch) — PyTorch 2.0 documentation
Community
https://pytorch.org/docs/stable/elastic/run.html

Seonglae Cho