← Lessons

quiz vs the machine

Platinum1800

System Design

Trace Based Alerting

Triggering alerts on patterns inside traces, not just on aggregate metric thresholds.

5 min read · advanced · beat Platinum to climb

Beyond Metric Thresholds

A classic alert fires when a metric crosses a line, like error rate above two percent. But metrics flatten away the structure of a request. Trace based alerting evaluates conditions against the trace itself.

What You Can Express

  • Span level conditions: alert when a specific downstream span exceeds a latency budget.
  • Structural conditions: alert when a trace contains an unexpected dependency or a retry storm.
  • Combined conditions: alert when a slow trace also has an exception event on a payment span.

These are precise in a way a single aggregate cannot be. A metric says latency rose; a trace based alert says checkout is slow specifically because the inventory call is timing out.

The Trade Offs

Evaluating every trace is expensive, so rules often run on the sampled or tail selected set. There is also a risk of noisy rules that match many traces, so conditions must be specific enough to be actionable.

Key idea

Trace based alerting evaluates rules against the structure and spans of a trace, catching precise failures that aggregate metric thresholds hide.

Check yourself

Answer to earn rating on the learn ladder.

1. What can trace based alerting express that a metric threshold cannot?

2. Why are trace based rules often run on a sampled set?