Scale training across GPUs with DeepSpeed or Accelerate.
DeepSpeed's ZeRO stages:
- ZeRO-: Shard optimizer states
- ZeRO-: Also shard gradients
- ZeRO-: Also shard parameters
Accelerate is simpler: write code for one GPU and it scales. Run accelerate config then accelerate launch train.py.
DeepSpeed handles larger models. Accelerate is easier for basic multi-GPU setups. All major frameworks support both.