Match metrics to your task:
- Classification: Accuracy, F1, precision, recall
- Generation: BLEU, ROUGE, perplexity
- QA: Exact match, F1 on answers
- Code: Pass@k, execution success rate
- Dialogue: Human preference ratings, coherence scores
Automatic metrics are proxies. Human evaluation is the ground truth for generation quality.