Slicing by depth
Pipeline parallelism gives each device a contiguous block of layers, called a stage. A batch flows forward through the stages and gradients flow back, like an assembly line for the network.
The bubble problem
If a whole batch traverses one stage at a time, most devices sit idle waiting. This wasted time is the pipeline bubble.
- The bubble fraction grows with the number of stages.
- It shrinks as you feed more microbatches at once.
Scheduling to fill it
- Splitting a batch into many microbatches keeps several stages busy simultaneously.
- The one forward one backward schedule interleaves forward and backward passes to cut idle time and memory.
- Interleaved schedules assign each device several non contiguous layer chunks to shrink the bubble further.
Trade offs
- Communication is small, only activations at stage boundaries.
- But peak memory grows with in flight microbatches.
- Balancing stage compute matters; an uneven split stalls the whole line.
Key idea
Pipeline parallelism splits layers into stages and streams many microbatches with a one forward one backward schedule to shrink the idle bubble, paying only boundary communication.