Gradient checkpointing trades compute for memory. Instead of storing all activations, it saves checkpoints at intervals and recomputes activations during backward pass.
This can reduce activation memory by -% at the cost of ~-% slower training.
Enable it when you're memory-constrained. The extra compute is worth fitting larger batches or models.