During backpropagation, you compute and store gradients for each parameter. This adds another copy of parameter-sized memory.
B model gradients in FP32 = GB additional.
Total so far for B FP32 full fine-tune:
- Parameters: GB
- Optimizer: GB
- Gradients: GB
- Total: GB (and we haven't counted activations yet)