Bad alerts destroy on-call life. Good alerts are actionable.
Alert principles:
- Page on symptoms, not causes: Alert on "users can't log in," not "CPU high"
- Every alert needs a runbook: What to do when it fires
- Avoid flapping: Use proper thresholds and duration
Alert on SLOs:
- Burn rate alerts: "At this rate, we'll exhaust the error budget in X hours"
- Multi-window: Fast burn (short window) and slow burn (long window)
Interview question: "How do you reduce alert fatigue?"
Review alerts monthly. Delete ones that never fire. Combine related alerts. Ensure every alert has clear ownership and action.