GRPO (Group Relative Policy Optimization) samples multiple responses per prompt and ranks them.
Instead of paired comparisons, you generate several responses and use their relative rankings. The model learns from the ordering, not just pairwise preferences.
GRPO was used to train DeepSeek models. It can be more sample-efficient than DPO when you can generate multiple responses cheaply.