Fine-tuning can improve one thing and break another.
Evaluation checklist:
Target task: Did fine-tuning improve it?
General capabilities: Test on benchmarks (MMLU, etc.)
Safety: Still refuses harmful requests?
Catastrophic forgetting: Model loses general knowledge while learning specific task.
Mitigations:
- Lower learning rate
- Fewer epochs
- Mix general data with task data
- Use LoRA (base model frozen)
Interview question: "How do you prevent catastrophic forgetting?"
PEFT helps (frozen base). Mixed training data. Early stopping. Evaluate on diverse benchmarks.