Filling the machine
A GPU only reaches its potential when given enough parallel work. Batch size is the most direct lever: processing many samples at once gives the scheduler enough warps to keep every SM busy.
Too small versus too large
- A tiny batch leaves cores idle and wastes the device, since launch overhead and memory latency are not hidden.
- A large batch raises utilization and throughput, but uses more memory and can hurt model accuracy or latency per request.
There is a point of diminishing returns where utilization saturates and bigger batches only add memory pressure.
The throughput curve
Throughput rises with batch size, then flattens once the GPU is saturated.
Choosing a batch size
In training, larger batches improve hardware efficiency but may need learning rate tuning. In inference, batching boosts throughput at the cost of per request latency, so serving systems balance the two carefully.
Key idea
Batch size controls how fully the GPU is used: larger batches raise utilization and throughput up to saturation, trading memory and latency for efficiency.