Most outages are caused by changes. Change management reduces this risk.
Best practices:
- Progressive rollouts: Canary → percentage → full
- Change windows: Deploy during low-traffic periods
- Rollback plans: Every change needs a reversal strategy
- Change freezes: Avoid changes before high-traffic events
Google's approach:
- Changes are the primary cause of outages
- All changes should be reversible
- Automation reduces human error
Interview tip: "Deploy fix before Black Friday?" Depends on severity vs deployment risk. Use careful canary with rollback ready.