← Lessons

quiz vs the machine

Gold1410

Databases

Monitoring Database Health

Tracking the right signals like replication lag, connections, and saturation tells you a database is failing before users do.

5 min read · core · beat Gold to climb

What To Watch

A healthy database is not just up, it is keeping pace. Effective monitoring tracks signals across a few categories:

  • Saturation: CPU, memory, disk IO, and disk space. A full disk is a classic, avoidable outage.
  • Throughput and latency: queries per second and the p95 or p99 query time, not just the average, since tail latency hurts users.
  • Connections: active versus the maximum, plus pool wait time. Connection exhaustion blocks new work.
  • Replication lag: how far replicas trail the primary, which drives stale reads and risky failovers.
  • Errors: deadlocks, lock waits, and failed transactions.

Symptoms Versus Causes

Good dashboards separate symptoms users feel, like slow responses, from causes, like a saturated disk or lock contention. Alert on symptoms so you catch real impact, and keep cause metrics handy to diagnose quickly.

Baselines Beat Fixed Thresholds

A query time that is normal at peak may be alarming at 3 am. Comparing against a baseline of normal behavior catches anomalies that a single fixed threshold misses, while still alerting on hard limits like near full disk.

Key idea

Database health monitoring tracks saturation, latency tails, connections, replication lag, and errors, alerting on user facing symptoms while comparing against baselines rather than relying on fixed thresholds alone.

Check yourself

Answer to earn rating on the learn ladder.

1. Why track p95 or p99 query latency rather than only the average?

2. Why is replication lag an important health signal?

3. Why prefer baselines over a single fixed threshold for some metrics?