RLHF (Reinforcement Learning from Human Feedback) was the original alignment method. It works in steps:
Collect human preferences between response pairs
Train a reward model to predict preferences
Use RL (PPO) to optimize the LLM against the reward model
RLHF powered early ChatGPT. It works but is complex and unstable. Newer methods simplify this process.