Attention heads specialize:
- Some heads track syntactic relationships (subject-verb agreement)
- Some heads handle coreference (what pronouns refer to)
- Some heads capture positional patterns (adjacent tokens)
- Some heads seem to store factual knowledge
This specialization emerges from training. You don't program it. The model discovers useful attention patterns on its own.