← Lessons

quiz vs the machine

Gold1410

System Design

Broker Monitoring Metrics

The key signals that tell you whether a broker fleet is healthy.

5 min read · core · beat Gold to climb

What to watch

A broker can be up yet unhealthy. Good monitoring tracks a handful of signals that reveal trouble before users feel it.

Core metrics

  • Consumer lag: gap between latest offset and committed offset; rising lag means consumers cannot keep up.
  • Throughput: messages and bytes per second in and out, to spot load shifts.
  • Under replicated partitions: partitions whose ISR is short of the target, signaling replication trouble.
  • Request latency: produce and fetch times; spikes hint at disk or network pressure.
  • Disk and retention: free space and segment age, since a full disk stalls writes.

Saturation signals

Watch queue depth and dead letter queue depth. A growing DLQ means failures are climbing; a deep main queue means consumers are falling behind.

Alerting wisely

Alert on symptoms users feel, like sustained lag or under replicated partitions, not on every transient blip. Page on conditions that need human action; dashboard the rest.

Flow

Key idea

Monitor consumer lag, under replicated partitions, latency, and disk; alert on user facing symptoms so problems surface before they cascade.

Check yourself

Answer to earn rating on the learn ladder.

1. What does a rising consumer lag metric indicate?

2. Why watch under replicated partitions?

3. What is the wise approach to alerting?