Chaos engineering tests resilience by intentionally causing failures:
Netflix Chaos Monkey: Randomly kills production instances. Forces engineers to build resilient systems.
What to test:
- Server failures
- Network partitions
- High latency injection
- Resource exhaustion
Process: Define steady state (normal metrics) Hypothesize what happens during failure Inject failure in production Compare actual vs expected behavior
Start small: Begin in staging. Move to production with blast radius limits. Always have kill switch ready.