Latency vs Throughput

Why making one request fast and making many requests cheap pull a system in different directions.

Two different questions

Latency asks how long a single request takes from start to finish. Throughput asks how many requests the system finishes per second. They sound similar but they optimize for different things.

A system can have low latency yet low throughput if each request is fast but only one runs at a time. It can have high throughput yet high latency if it batches work so each item waits before processing.

The tension

Batching raises throughput because fixed costs are shared across many items, but it raises latency because items wait for a batch to fill.
Parallelism raises throughput by using more cores or machines, but coordination can add latency.
A queue absorbs bursts and keeps throughput steady, yet items in the queue experience longer waits.

What to optimize

A trading system or a keystroke handler cares most about latency.
A nightly report or a log pipeline cares most about throughput.
Most user facing systems target a latency budget at the tail, like the ninety ninth percentile, then maximize throughput within it.

Measure both. A single average number hides the trade you are actually making.

Key idea

Latency is the speed of one request and throughput is the volume of many, and tuning for one often costs the other.

Latency vs Throughput

Two different questions

The tension

What to optimize

Key idea

Check yourself