Measure alignment success through:
- Preference win rate: How often does the aligned model win against baseline?
- Reward model score: Do outputs score higher than before?
- Human evaluation: Do people actually prefer the outputs?
- Safety benchmarks: Does the model refuse harmful requests appropriately?
- Capability retention: Did you break anything during alignment?
Use multiple metrics. No single number captures alignment quality.