Best-of-N is the simplest alignment technique: generate responses, score them with a reward model, and return the best one.
No training required. You're just filtering at inference time. With , you often see much better outputs than greedy decoding.
The trade-off is inference cost: you generate x more tokens. Use Best-of-N as a baseline before investing in fine-tuning. If it solves your problem, you don't need to train anything.