Define success before you start fine-tuning.
Good metrics:
- Task-specific accuracy (classification, extraction)
- Human preference ratings (A/B tests)
- Response quality scores (LLM-as-judge)
- Business metrics (resolution rate, user satisfaction)
Bad metrics:
- Training loss alone (doesn't mean good outputs)
- General benchmarks (don't reflect your task)
If you can't measure improvement, you can't know if fine-tuning worked.