Residual connections add the input back to the output of each sublayer:
output = LayerNorm(x + Sublayer(x))
This helps gradients flow through deep networks. Without residuals, gradients would have to flow through every transformation. With residuals, they can skip directly to earlier layers.
Residual connections make training deep transformers possible. They're why we can stack , , or layers.