DeepSpeed ZeRO offers levels of sharding:
- ZeRO-: Shards optimizer states only. Moderate savings.
- ZeRO-: Shards optimizer states and gradients. Good balance.
- ZeRO-: Shards everything including parameters. Maximum memory efficiency.
ZeRO- with CPU offload can train B+ models on a single GPU, though slowly. Choose the level based on your memory constraints.