Checkpoints consume significant storage. A B model is ~GB per checkpoint.
Strategies:
- Save every N steps, keep last K
- Save only when validation improves
- Use adapter-only checkpoints with LoRA (~MB)
Automate cleanup. Full training runs can generate hundreds of gigabytes without management.