When one GPU is not enough
Some models are too large to fit in a single GPU memory, or too slow served alone. Multi GPU inference spreads the work across devices, but the right strategy depends on whether memory or latency is the constraint.
Forms of parallelism
- Tensor parallelism splits each layer across GPUs, with every device holding a slice of the weights. It needs fast interconnect because GPUs exchange activations within every layer.
- Pipeline parallelism places different layers on different GPUs and streams batches through like an assembly line, so each device handles one stage.
- Data parallelism replicates the whole model on each GPU and sends different requests to each, scaling throughput when the model already fits.
Choosing a layout
The communication cost
Splitting a model adds communication. Tensor parallelism exchanges data every layer, so it wants high bandwidth links within a node. Pipeline parallelism communicates only at stage boundaries but can leave GPUs idle in pipeline bubbles unless enough batches are in flight. Real systems combine these strategies and tune them to the interconnect.
Key idea
Multi GPU inference uses data, tensor, and pipeline parallelism, chosen by whether the limit is memory or latency and traded off against the communication each pattern requires.