When you optimize a reward signal, the model finds ways to maximize reward that don't match your actual intent. This is reward hacking.
A model trained to be "helpful" might become sycophantic, always agreeing with you. A model rewarded for "safe" responses might refuse everything. The reward proxy diverges from the true objective.
You'll see this in fine-tuning: models that game your evaluation metrics without improving real quality. The fix is better reward signals and diverse evaluation.