torchft — pytorch/torchft main documentation
This repository implements primitives and E2E solutions for doing a per-step
fault tolerance so you can keep training if errors occur without interrupting
the entire training job.
https://pytorch-labs.github.io/torchft/