Scenario: Requests intermittently fail in a microservices architecture.
Challenges:
- Failures may not reproduce
- Problem could be in any service
- Network issues are transient
Investigation approach:
Use distributed tracing to find failing span
Check error rates across all services
Look for correlation (time, user, region)
Common causes:
- Timeout misconfiguration (cascading failures)
- Retry storms amplifying load
- One bad instance in the pool
Interview tip: Ask about retry policies and circuit breakers. These often cause cascading failures.