When one GPU is not enough
Some models are too large to fit on a single GPU. Model parallelism splits the model itself across devices. Two main styles exist: tensor parallelism and pipeline parallelism, and big systems combine them.
Tensor parallelism
Tensor parallelism splits individual layers. A large matrix multiply is divided so each GPU computes part of it, then the partial results are combined with a communication step:
- Each device holds a slice of each weight matrix.
- Every device works on the same tokens at the same time.
- A reduce operation merges the slices each layer.
It needs fast interconnects because devices talk on every layer, so it usually stays within one machine.
Pipeline parallelism
Pipeline parallelism splits the model by depth into stages, one group of layers per GPU. Tokens flow through stage one, then stage two, and so on. To avoid GPUs idling while waiting for the previous stage, the batch is cut into micro batches that flow through the pipeline together, keeping every stage busy. A startup and drain period called the bubble still wastes some time.
Key idea
Tensor parallelism splits each layer across GPUs needing fast links, while pipeline parallelism splits the model by depth into stages fed by micro batches to stay busy.