Monitor these metrics to catch problems before they cause outages:
Query performance:
- Slow query log (queries > ms)
- Query throughput (QPS)
- Average query latency
Resource utilization:
- Connection count vs limit
- Buffer pool hit ratio
- Disk I/O and space
Replication:
- Replication lag
- Replica status
Alerts: Set thresholds for connection count (% of max), replication lag (> s), and disk space (% full).