Checkpoints consume significant storage. A 7B model is ~14GB per checkpoint.
Strategies:
- Save every N steps, keep last K
- Save only when validation improves
- Use adapter-only checkpoints with LoRA (~100MB)
Automate cleanup. Full training runs can generate hundreds of gigabytes without management.