GPUs excel at parallel matrix operations. Training involves massive matrix multiplications across attention and feed-forward layers.
A CPU might have - cores. A modern GPU has thousands of smaller cores designed for parallel work. An A100 GPU can perform + trillion operations per second on half-precision math. This parallelism makes training feasible.