A single attention head can only focus on one type of relationship. But language has many relationship types: syntactic, semantic, positional.
Multi-head attention runs several attention operations in parallel, each with different learned Q, K, V projections. One head might learn to track subject-verb relationships. Another might track coreference. Together they capture richer patterns than any single head could.