GPUs excel at parallel matrix operations. Training involves massive matrix multiplications across attention and feed-forward layers.
A CPU might have 8-16 cores. A modern GPU has thousands of smaller cores designed for parallel work. An A100 GPU can perform 300+ trillion operations per second on half-precision math. This parallelism makes training feasible.