Red teaming means actively trying to break your model before users do. You search for prompts that produce harmful, incorrect, or embarrassing outputs.
Approaches:
- Manual adversarial prompting by your team
- Automated jailbreak testing with known attack patterns
- Hiring external red teamers for fresh perspectives
Document what you find and add examples to your training data. Red teaming is especially important before public deployment. What you don't catch, your users will.