← Lessons

quiz vs the machine

Platinum1800

Machine Learning

Model Parallelism Tensor and Pipeline

Splitting a model across GPUs by partitioning tensors or by stacking stages.

6 min read · advanced · beat Platinum to climb

When one GPU is not enough

Some models are too large to fit on a single GPU. Model parallelism splits the model itself across devices. Two main styles exist: tensor parallelism and pipeline parallelism, and big systems combine them.

Tensor parallelism

Tensor parallelism splits individual layers. A large matrix multiply is divided so each GPU computes part of it, then the partial results are combined with a communication step:

  • Each device holds a slice of each weight matrix.
  • Every device works on the same tokens at the same time.
  • A reduce operation merges the slices each layer.

It needs fast interconnects because devices talk on every layer, so it usually stays within one machine.

Pipeline parallelism

Pipeline parallelism splits the model by depth into stages, one group of layers per GPU. Tokens flow through stage one, then stage two, and so on. To avoid GPUs idling while waiting for the previous stage, the batch is cut into micro batches that flow through the pipeline together, keeping every stage busy. A startup and drain period called the bubble still wastes some time.

Key idea

Tensor parallelism splits each layer across GPUs needing fast links, while pipeline parallelism splits the model by depth into stages fed by micro batches to stay busy.

Check yourself

Answer to earn rating on the learn ladder.

1. What does tensor parallelism split?

2. Why does pipeline parallelism use micro batches?

3. Why does tensor parallelism need fast interconnects?