DPO (Direct Preference Optimization) simplifies RLHF dramatically. Instead of training a separate reward model and using RL, DPO optimizes preferences directly.
The insight: You can derive the optimal policy from preferences without explicit reward modeling. DPO uses a clever loss function that implicitly learns rewards while training the policy.
Same results as RLHF, much simpler implementation.