Sizing the Workforce
Throughput equals workers times jobs each can finish per second. When the queue grows faster than workers drain it, latency climbs. The fix is more capacity, added two ways.
Concurrency and Replicas
- Concurrency runs several jobs inside one worker process, ideal when work waits on input and output.
- Replicas add more worker processes or machines, needed when work is processor bound.
For tasks that spend time waiting on network or disk, raise concurrency cheaply. For heavy computation, add replicas because one core handles one busy job.
Autoscaling Signals
- Queue depth is the clearest signal. Scale up when backlog rises.
- Oldest job age captures latency directly and is harder to game.
- Processor or memory use suits compute bound pools.
Scale up fast to clear backlog, scale down slowly to avoid flapping.
Backpressure
If workers cannot keep up and the queue grows without bound, you risk memory exhaustion. Apply backpressure by rejecting or slowing producers, or by shedding low value jobs.
Key idea
Scale workers by concurrency for input output bound jobs and by replicas for compute bound jobs, driven by queue depth or job age.