Practical guidance for DPO:
- Start with a good SFT model. DPO refines, it doesn't fix broken models.
- Use learning rates around e- to e-. Lower than SFT.
- Train for - epochs. Overfitting is easy with preference data.
- Monitor both chosen and rejected log probabilities. Both should change appropriately.
- Evaluate on held-out preferences, not just loss.