Training loss isn't enough. Evaluate your fine-tuned model properly:
- Hold-out test set (never seen during training)
- Task-specific metrics (accuracy, F, BLEU, etc.)
- Human evaluation for subjective quality
- Regression testing on general capabilities
Compare against the base model and your prompting baseline. Fine-tuning should beat both.