Alignment can be iterative:
Train with DPO on initial preferences
Generate new responses from aligned model
Collect new preferences on these responses
Train again on combined data
Each iteration improves the model's baseline. Later preferences challenge the model on harder cases. This is how frontier models continue improving.