DPO needs a reference model, typically your SFT model before alignment. This reference anchors training.
The loss compares how much you've changed from the reference. Without it, the model might collapse to always outputting the shortest response or other degenerate solutions.
Keep the reference frozen throughout training. It's just for computing the baseline.