Though DPO avoids explicit rewards, reward models remain useful for evaluation.
A reward model scores responses on a scale. You can use it to:
- Filter training data (keep high-reward examples)
- Evaluate model outputs (compare before/after alignment)
- Guide generation (best-of-n sampling)
Train reward models on the same preference data. They predict which response humans would prefer.