Activation Checkpointing

Gradient checkpointing

A technique to reduce memory usage by erasing the activation of a specific layer and recalculating it during the reverse pass. An algorithm that saves memory by storing only some of the activations of the neural network, and the discarded activations are restored through recalculation during backward propagation.

When a module is designated as a checkpoint, the module's inputs and outputs remain in memory at the end of the forward pass. All intermediate tensors that made up part of the calculation within the module are emptied during the forward pass. These tensors are recalculated during the backward pass through the checkpoint module. At this point, the layers behind this checkpoint module have completed the backward pass, so you can reduce the maximum memory usage of the checkpoint.

arxiv.org

https://arxiv.org/pdf/1904.10631

How Activation Checkpointing enables scaling up training deep learning models

By Yiftach Beer, Omri Bar

https://medium.com/pytorch/how-activation-checkpointing-enables-scaling-up-training-deep-learning-models-7a93ae01ff2d

Activation Checkpointing

Gradient checkpointing

Recommendations