Model Parallelism Tensor and Pipeline

When one GPU is not enough

Some models are too large to fit on a single GPU. Model parallelism splits the model itself across devices. Two main styles exist: tensor parallelism and pipeline parallelism, and big systems combine them.

Tensor parallelism

Tensor parallelism splits individual layers. A large matrix multiply is divided so each GPU computes part of it, then the partial results are combined with a communication step:

Each device holds a slice of each weight matrix.
Every device works on the same tokens at the same time.
A reduce operation merges the slices each layer.

It needs fast interconnects because devices talk on every layer, so it usually stays within one machine.

Pipeline parallelism

Pipeline parallelism splits the model by depth into stages, one group of layers per GPU. Tokens flow through stage one, then stage two, and so on. To avoid GPUs idling while waiting for the previous stage, the batch is cut into micro batches that flow through the pipeline together, keeping every stage busy. A startup and drain period called the bubble still wastes some time.

Key idea

Tensor parallelism splits each layer across GPUs needing fast links, while pipeline parallelism splits the model by depth into stages fed by micro batches to stay busy.

Model Parallelism Tensor and Pipeline

When one GPU is not enough

Tensor parallelism

Pipeline parallelism

Key idea

Check yourself