Fully Sharded Data Parallel (FSDP) shards the model across GPUs. Each GPU holds only a fraction of parameters, gradients, and optimizer states.
During forward/backward passes, parameters are gathered when needed and released after. This dramatically reduces per-GPU memory.
FSDP enables training models that don't fit on a single GPU. A B model can train on GPUs where it couldn't on .