← Lessons

quiz vs the machine

Platinum1750

Machine Learning

Monitoring Inference Latency And Cost

Watch tail latency and spend so serving stays healthy.

5 min read · advanced · beat Platinum to climb

Why monitor serving

A model can be accurate yet fail in production if it is slow or expensive. Monitoring tracks how the service behaves under real traffic so problems surface before users feel them.

Latency beyond the average

  • The average hides slow requests, so it is a weak signal.
  • Percentiles like the ninety fifth and ninety ninth show the slow tail that users actually notice.
  • For token streaming, track time to first token and time between tokens separately.

Cost signals

  • Cost per request and per token ties spend to usage.
  • GPU utilization shows whether you are paying for idle hardware.
  • A rising cost with flat traffic signals waste or a regression.

Turning signals into action

Teams set targets and alerts on tail latency and cost. When the ninety ninth percentile crosses a line, autoscaling or batching tuning kicks in. When cost per token drifts up, they look for low utilization or oversized instances.

Key idea

Effective inference monitoring watches tail latency percentiles and cost per token, not just averages, and ties them to alerts. Catching a slow tail or rising cost early lets teams tune batching, scaling, and instance size before users or budgets suffer.

Check yourself

Answer to earn rating on the learn ladder.

1. Why are latency percentiles better than the average?

2. What does low GPU utilization with steady traffic suggest?