← Lessons

quiz vs the machine

Gold1400

Machine Learning

Autoscaling Inference Services

Add and remove serving instances as demand rises and falls.

5 min read · core · beat Gold to climb

Why scale automatically

Traffic to a model service rises and falls through the day. Autoscaling adds instances when demand climbs and removes them when it falls, so you pay for capacity only when you need it.

What to scale on

  • Queue length or pending requests directly reflect pressure.
  • Latency rising past a target signals overload.
  • GPU utilization shows whether hardware is saturated.

The GPU twist

Inference autoscaling is harder than typical web scaling because new GPU instances suffer a cold start loading weights. By the time a new instance is ready, the spike may have grown. Scaling must react early and account for warmup time.

Scaling to zero

For rare traffic you can scale to zero instances and pay nothing while idle, accepting a cold start on the next request. Frequently used services keep a minimum of warm instances to avoid that penalty.

Key idea

Autoscaling matches serving capacity to demand using signals like queue length and latency. GPU cold starts make it react slower than web autoscaling, so teams scale early and keep a warm minimum unless they accept scale to zero penalties.

Check yourself

Answer to earn rating on the learn ladder.

1. Why is autoscaling inference harder than autoscaling web servers?

2. What is the cost of scaling an inference service to zero?