Standard benchmarks for LLM evaluation:
- MMLU: Multi-task language understanding
- HellaSwag: Commonsense reasoning
- TruthfulQA: Truthfulness in responses
- HumanEval: Code generation
- MT-Bench: Multi-turn conversation
Run relevant benchmarks before and after fine-tuning. Compare against base model performance.