← Lessons

quiz vs the machine

Silver1120

Machine Learning

Dynamic Batching For Throughput

Group nearby requests so the GPU does more work per pass.

4 min read · intro · beat Silver to climb

Why batch at all

A GPU is fastest when it processes many inputs in one pass. Sending one request at a time leaves most of the hardware idle. Batching packs several requests together so each forward pass does more useful work.

The dynamic part

In production requests arrive at random times. Dynamic batching waits a tiny window, collects whatever requests landed, then runs them as one batch. The server forms batches on the fly instead of requiring callers to group inputs themselves.

The core tradeoff

  • A longer wait window builds bigger batches and raises throughput.
  • A longer wait also adds latency for the first request in the batch.
  • A short window keeps latency low but wastes GPU capacity.

Tuning the window

Operators set a maximum wait time and a maximum batch size. The server fires a batch as soon as either limit is hit, balancing how busy the GPU stays against how long any request waits.

Key idea

Dynamic batching trades a small wait for far higher throughput by filling each GPU pass. The wait window and batch size knobs set where you land on the latency versus throughput curve.

Check yourself

Answer to earn rating on the learn ladder.

1. What does dynamic batching trade to gain throughput?

2. When does the server fire a dynamic batch?