What To Watch
A healthy database is not just up, it is keeping pace. Effective monitoring tracks signals across a few categories:
- Saturation: CPU, memory, disk IO, and disk space. A full disk is a classic, avoidable outage.
- Throughput and latency: queries per second and the p95 or p99 query time, not just the average, since tail latency hurts users.
- Connections: active versus the maximum, plus pool wait time. Connection exhaustion blocks new work.
- Replication lag: how far replicas trail the primary, which drives stale reads and risky failovers.
- Errors: deadlocks, lock waits, and failed transactions.
Symptoms Versus Causes
Good dashboards separate symptoms users feel, like slow responses, from causes, like a saturated disk or lock contention. Alert on symptoms so you catch real impact, and keep cause metrics handy to diagnose quickly.
Baselines Beat Fixed Thresholds
A query time that is normal at peak may be alarming at 3 am. Comparing against a baseline of normal behavior catches anomalies that a single fixed threshold misses, while still alerting on hard limits like near full disk.
Key idea
Database health monitoring tracks saturation, latency tails, connections, replication lag, and errors, alerting on user facing symptoms while comparing against baselines rather than relying on fixed thresholds alone.