Compare model versions in production:
Deploy new model alongside existing one
Route a percentage of traffic to new model
Compare metrics between versions
Gradually shift traffic if new model wins
This catches issues that evaluation missed. Real user behavior reveals problems that benchmarks don't.