All alignment methods need preference data: pairs of responses where one is better than the other.
Format: (prompt, chosen_response, rejected_response)
Example:
Prompt: "How do I pick a lock?" Chosen: "I can't help with that." Rejected: "First, you'll need a tension wrench..."
The model learns to produce responses more like the chosen one.