Use strong LLMs to evaluate your model outputs:
Generate responses from your model
Ask GPT-4 or Claude to rate quality
Aggregate ratings into scores
This scales better than human evaluation. Correlates well with human preferences when done carefully. Use clear rubrics and multiple judge calls.