← Lessons

quiz vs the machine

Platinum1820

Machine Learning

The Multi GPU Inference

Splitting big models across several GPUs to fit and serve them.

6 min read · advanced · beat Platinum to climb

When one GPU is not enough

Some models are too large to fit in a single GPU memory, or too slow served alone. Multi GPU inference spreads the work across devices, but the right strategy depends on whether memory or latency is the constraint.

Forms of parallelism

  • Tensor parallelism splits each layer across GPUs, with every device holding a slice of the weights. It needs fast interconnect because GPUs exchange activations within every layer.
  • Pipeline parallelism places different layers on different GPUs and streams batches through like an assembly line, so each device handles one stage.
  • Data parallelism replicates the whole model on each GPU and sends different requests to each, scaling throughput when the model already fits.

Choosing a layout

The communication cost

Splitting a model adds communication. Tensor parallelism exchanges data every layer, so it wants high bandwidth links within a node. Pipeline parallelism communicates only at stage boundaries but can leave GPUs idle in pipeline bubbles unless enough batches are in flight. Real systems combine these strategies and tune them to the interconnect.

Key idea

Multi GPU inference uses data, tensor, and pipeline parallelism, chosen by whether the limit is memory or latency and traded off against the communication each pattern requires.

Check yourself

Answer to earn rating on the learn ladder.

1. What does tensor parallelism split across GPUs?

2. What is a downside of pipeline parallelism?

3. When is data parallelism the natural choice?