← Lessons

quiz vs the machine

Gold1460

Machine Learning

Inference Batching and Throughput

Group requests to raise GPU utilization while balancing latency and throughput.

6 min read · core · beat Gold to climb

Why batch at inference

A GPU is most efficient when it processes many inputs at once. Serving one request at a time leaves the device underused. Batching groups requests so a single kernel launch does more work, raising throughput, the number of requests served per second.

The latency tradeoff

Batching helps throughput but can hurt latency for an individual request, which may wait for the batch to fill. The serving system must balance the two.

  • A larger maximum batch size raises throughput but can add waiting time.
  • A short batch timeout caps how long a request waits before the batch flushes, protecting tail latency.

Dynamic and continuous batching

  • Dynamic batching collects requests that arrive within a small window and runs them together, adapting to live traffic.
  • Continuous batching, used for text generation, lets new requests join a running batch as soon as others finish their tokens, instead of waiting for the whole batch to complete. This keeps the GPU busy and is far more efficient for variable length outputs.

Key idea

Batching raises GPU throughput at some cost to per request latency; batch size and timeout tune that balance, while continuous batching keeps generation servers efficient under variable length workloads.

Check yourself

Answer to earn rating on the learn ladder.

1. What is the main tradeoff when increasing inference batch size?

2. Why is continuous batching efficient for text generation?

3. What does a batch timeout protect?