torchft

Creator

Creator

Created

Created

2025 Jan 9 13:49

Editor

Editor

Edited

Edited

2025 Jan 9 13:50

Refs

Refs

pytorch-labs • Updated 2025 Jan 9 13:18

PyTorch per step Fault Tolerance

torchft — pytorch/torchft main documentation

This repository implements primitives and E2E solutions for doing a per-step fault tolerance so you can keep training if errors occur without interrupting the entire training job.

https://pytorch-labs.github.io/torchft/

Recommendations

/////////