The Batch Size and GPU Utilization

Filling the machine

A GPU only reaches its potential when given enough parallel work. Batch size is the most direct lever: processing many samples at once gives the scheduler enough warps to keep every SM busy.

Too small versus too large

A tiny batch leaves cores idle and wastes the device, since launch overhead and memory latency are not hidden.
A large batch raises utilization and throughput, but uses more memory and can hurt model accuracy or latency per request.

There is a point of diminishing returns where utilization saturates and bigger batches only add memory pressure.

The throughput curve

Throughput rises with batch size, then flattens once the GPU is saturated.

Choosing a batch size

In training, larger batches improve hardware efficiency but may need learning rate tuning. In inference, batching boosts throughput at the cost of per request latency, so serving systems balance the two carefully.

Key idea

Batch size controls how fully the GPU is used: larger batches raise utilization and throughput up to saturation, trading memory and latency for efficiency.

The Batch Size and GPU Utilization

Filling the machine

Too small versus too large

The throughput curve

Choosing a batch size

Key idea

Check yourself