Beyond Metric Thresholds
A classic alert fires when a metric crosses a line, like error rate above two percent. But metrics flatten away the structure of a request. Trace based alerting evaluates conditions against the trace itself.
What You Can Express
- Span level conditions: alert when a specific downstream span exceeds a latency budget.
- Structural conditions: alert when a trace contains an unexpected dependency or a retry storm.
- Combined conditions: alert when a slow trace also has an exception event on a payment span.
These are precise in a way a single aggregate cannot be. A metric says latency rose; a trace based alert says checkout is slow specifically because the inventory call is timing out.
The Trade Offs
Evaluating every trace is expensive, so rules often run on the sampled or tail selected set. There is also a risk of noisy rules that match many traces, so conditions must be specific enough to be actionable.
Key idea
Trace based alerting evaluates rules against the structure and spans of a trace, catching precise failures that aggregate metric thresholds hide.