Safety alignment prevents harmful outputs: misinformation, dangerous instructions, privacy violations, and offensive content.
Techniques include:
- Training on refusal examples for harmful requests
- RLHF with safety-focused reward models
- Constitutional AI with safety principles
- Output filtering as a safety layer
Safety is an ongoing process. You'll need to red-team your model, monitor production outputs, and iterate. No alignment technique makes a model perfectly safe.