Gradient accumulation simulates larger batch sizes without the memory cost. Instead of one backward pass per batch, you accumulate gradients over multiple mini-batches before updating.
With accumulation steps = and micro batch = , effective batch size = .
This lets you train with stable, large effective batch sizes on limited VRAM. The training dynamics match true large batches.