RLHF has practical issues:
- Training instability: PPO is sensitive to hyperparameters
- Reward hacking: Models exploit reward model weaknesses
- Complexity: Requires running models simultaneously
- Memory: Reward model adds significant overhead
These problems drove the search for simpler alternatives. DPO emerged as the solution.