Requirements
- Collect metrics from thousands of hosts and services.
- Store time series efficiently and query them for dashboards.
- Alert when values cross thresholds.
High level design
Agents emit metrics that are ingested, aggregated, stored in a time series database, and queried for dashboards and alerts.
- Collection: a push or pull agent reports counters and gauges tagged with labels.
- Storage: a time series database compresses points and downsamples old data.
- Alerting: an evaluator runs rules over recent data and fires when conditions hold.
Bottlenecks
- Cardinality: too many label combinations explode storage, so cap labels and avoid unbounded values like user ids.
- Write volume: high frequency points are heavy, so batch writes and downsample older data to coarser resolution.
- Query load: dashboards scan ranges, so precompute rollups for common intervals.
Alerts need deduplication and grouping so an outage does not page on every host at once.
Key idea
A monitoring system ingests tagged time series into a compressed database, controlling cardinality and downsampling so dashboards and alerts stay fast.