Chaos engineering tests system resilience by intentionally introducing failures in production.
Scientific method:
Define steady state (normal metrics)
Hypothesize: System will maintain steady state during X failure
Introduce failure
Observe: Did steady state hold?
Principles:
- Start small, increase blast radius gradually
- Run in production (staging is too different)
- Automate experiments for regular testing
- Minimize blast radius with kill switches
Netflix runs chaos experiments continuously. You find problems before customers do.