The Multi GPU Inference

When one GPU is not enough

Some models are too large to fit in a single GPU memory, or too slow served alone. Multi GPU inference spreads the work across devices, but the right strategy depends on whether memory or latency is the constraint.

Forms of parallelism

Tensor parallelism splits each layer across GPUs, with every device holding a slice of the weights. It needs fast interconnect because GPUs exchange activations within every layer.
Pipeline parallelism places different layers on different GPUs and streams batches through like an assembly line, so each device handles one stage.
Data parallelism replicates the whole model on each GPU and sends different requests to each, scaling throughput when the model already fits.

Choosing a layout

The communication cost

Splitting a model adds communication. Tensor parallelism exchanges data every layer, so it wants high bandwidth links within a node. Pipeline parallelism communicates only at stage boundaries but can leave GPUs idle in pipeline bubbles unless enough batches are in flight. Real systems combine these strategies and tune them to the interconnect.

Key idea

Multi GPU inference uses data, tensor, and pipeline parallelism, chosen by whether the limit is memory or latency and traded off against the communication each pattern requires.

The Multi GPU Inference

When one GPU is not enough

Forms of parallelism

Choosing a layout

The communication cost

Key idea

Check yourself