RLHF (Reinforcement Learning from Human Feedback) was the original alignment method. It works in 3 steps:
1. Collect human preferences between response pairs
2. Train a reward model to predict preferences
3. Use RL (PPO) to optimize the LLM against the reward model
RLHF powered early ChatGPT. It works but is complex and unstable. Newer methods simplify this process.