The full attention formula:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
The scaling factor sqrt(d_k) is important. Without it, dot products grow large with dimension size, pushing softmax into regions with tiny gradients. Scaling keeps gradients healthy during training.
This single operation is the heart of transformers. Everything else is built around it.