After computing attention separately in each head, the results are concatenated and projected back to the model dimension.
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_O
The output projection W_O learns to combine information from all heads. This final linear layer mixes the different attention patterns into a unified representation.