DPO trains on preference pairs. For each pair, it:
Computes log probability of chosen response under trained model
Computes log probability of rejected response under trained model
Computes same probabilities under reference model (frozen SFT model)
Applies a loss that increases chosen probability and decreases rejected probability relative to reference
No reward model. No RL. Just supervised learning on preferences.