You now understand the memory constraints of fine-tuning and how to work within them.
Key takeaways:
- Training memory = parameters + optimizer + gradients + activations
- BF16 is the default choice for modern training
- Gradient checkpointing trades compute for memory
- FSDP and DeepSpeed enable training models that don't fit on one GPU
- Match your hardware to your method and model size
Next, I'll teach you how to prepare data for fine-tuning.