Decision guide:
- Have paired preferences: Start with DPO
- Have only ratings: Use KTO
- Want simpler pipeline: Try ORPO or SimPO
- Memory constrained: SimPO (no reference model)
- Can generate multiple responses: Consider GRPO
DPO is the safe default. Others offer advantages in specific situations. All beat RLHF for simplicity.