Total training memory roughly equals:
Memory = Parameters + Optimizer States + Gradients + Activations
For FP32 full fine-tuning:
- Parameters: bytes × param count
- Optimizer (AdamW): bytes × param count
- Gradients: bytes × param count
- Activations: Varies with batch size and sequence length
Mixed precision and PEFT methods reduce these numbers dramatically.