Layer normalization stabilizes training by normalizing activations across the feature dimension.
For each position, it computes mean and variance across all features, then normalizes:
LayerNorm(x) = gamma * (x - mean) / sqrt(variance + epsilon) + beta
Gamma and beta are learned parameters. This prevents activations from exploding or vanishing as signals pass through many layers.