The Pipeline Parallelism

Filling the bubble

Plain model parallelism leaves devices idle while they wait their turn. Pipeline parallelism fixes this by splitting the batch into smaller micro batches and streaming them through the stages like an assembly line.

The model is cut into sequential stages, one per device.
Micro batches enter one after another so stages overlap work.
While stage two processes micro batch one, stage one starts micro batch two.

The pipeline bubble

At the start and end of each step some stages have nothing to do. This idle region is the bubble. More micro batches per step shrink the bubble fraction, raising utilization toward full.

Bubble overhead falls as micro batch count rises.
Too many micro batches can hurt the batch statistics.
Schedules like one forward one backward reduce peak memory.

Stage flow

By keeping every stage fed with micro batches, the pipeline approaches the throughput of having no idle stages.

Key idea

Pipeline parallelism stages the model across devices and streams micro batches through them, shrinking the idle bubble so model parallel hardware stays busy.

The Pipeline Parallelism

Filling the bubble

The pipeline bubble

Stage flow

Key idea

Check yourself