SimPO removes the reference model from DPO entirely. It uses response length as an implicit regularizer instead.
This makes training simpler and faster. No need to keep a reference model in memory or compute reference probabilities.
SimPO performs comparably to DPO on most benchmarks. It's a good choice when memory is constrained.