Data parallelism distributes batches across multiple GPUs. Each GPU holds a complete model copy and processes different data.
After forward and backward passes, gradients are averaged across GPUs (all-reduce). All GPUs update their models identically.
This scales training speed linearly with GPUs but doesn't reduce per-GPU memory. You still need each model to fit on one GPU.