Match metrics to your task:
- Classification: Accuracy, F, precision, recall
- Generation: BLEU, ROUGE, perplexity
- QA: Exact match, F on answers
- Code: Pass@k, execution success rate
- Dialogue: Human preference ratings, coherence scores
Automatic metrics are proxies. Human evaluation is the ground truth for generation quality.