What to watch
A broker can be up yet unhealthy. Good monitoring tracks a handful of signals that reveal trouble before users feel it.
Core metrics
- Consumer lag: gap between latest offset and committed offset; rising lag means consumers cannot keep up.
- Throughput: messages and bytes per second in and out, to spot load shifts.
- Under replicated partitions: partitions whose ISR is short of the target, signaling replication trouble.
- Request latency: produce and fetch times; spikes hint at disk or network pressure.
- Disk and retention: free space and segment age, since a full disk stalls writes.
Saturation signals
Watch queue depth and dead letter queue depth. A growing DLQ means failures are climbing; a deep main queue means consumers are falling behind.
Alerting wisely
Alert on symptoms users feel, like sustained lag or under replicated partitions, not on every transient blip. Page on conditions that need human action; dashboard the rest.
Flow
Key idea
Monitor consumer lag, under replicated partitions, latency, and disk; alert on user facing symptoms so problems surface before they cascade.