The DPO loss:
L = -log(σ(β(log π(y_w|x)/π_ref(y_w|x) - log π(y_l|x)/π_ref(y_l|x))))
Where y_w is the winning response, y_l is the losing response, π is your model, π_ref is the reference, and β controls preference strength.
Don't memorize this. Understand that it pushes chosen responses higher and rejected responses lower relative to where they started.