← Lessons

quiz vs the machine

Gold1550

System Design

Design a Metrics and Monitoring System

Collect, store, and alert on time series metrics from many services.

7 min read · core · beat Gold to climb

Requirements

  • Collect metrics from thousands of hosts and services.
  • Store time series efficiently and query them for dashboards.
  • Alert when values cross thresholds.

High level design

Agents emit metrics that are ingested, aggregated, stored in a time series database, and queried for dashboards and alerts.

  • Collection: a push or pull agent reports counters and gauges tagged with labels.
  • Storage: a time series database compresses points and downsamples old data.
  • Alerting: an evaluator runs rules over recent data and fires when conditions hold.

Bottlenecks

  • Cardinality: too many label combinations explode storage, so cap labels and avoid unbounded values like user ids.
  • Write volume: high frequency points are heavy, so batch writes and downsample older data to coarser resolution.
  • Query load: dashboards scan ranges, so precompute rollups for common intervals.

Alerts need deduplication and grouping so an outage does not page on every host at once.

Key idea

A monitoring system ingests tagged time series into a compressed database, controlling cardinality and downsampling so dashboards and alerts stay fast.

Check yourself

Answer to earn rating on the learn ladder.

1. Why is high cardinality a problem?

2. How is storage kept manageable for old data?