The Pipeline Parallelism Deep

Slicing the layer stack across devices and streaming microbatches to fill bubbles.

Slicing by depth

Pipeline parallelism gives each device a contiguous block of layers, called a stage. A batch flows forward through the stages and gradients flow back, like an assembly line for the network.

The bubble problem

If a whole batch traverses one stage at a time, most devices sit idle waiting. This wasted time is the pipeline bubble.

The bubble fraction grows with the number of stages.
It shrinks as you feed more microbatches at once.

Scheduling to fill it

Splitting a batch into many microbatches keeps several stages busy simultaneously.
The one forward one backward schedule interleaves forward and backward passes to cut idle time and memory.
Interleaved schedules assign each device several non contiguous layer chunks to shrink the bubble further.

Trade offs

Communication is small, only activations at stage boundaries.
But peak memory grows with in flight microbatches.
Balancing stage compute matters; an uneven split stalls the whole line.

Key idea