You now understand preference alignment, the technique that makes models pleasant to use.
Takeaways:
- DPO simplified RLHF into a single training objective
- KTO works with ratings instead of pairs
- ORPO and SimPO remove the reference model requirement
- Synthetic preferences scale with AI judges
- Alignment refines capable models. It doesn't fix broken ones.
Next, I'll show you the tools that make fine-tuning practical.