A memory problem tensor parallelism misses
Tensor parallelism shards the matrix multiplies, but the layer norm and dropout regions still replicate full activations across devices. With long sequences those activations dominate memory.
The idea
Sequence parallelism splits the activations along the sequence length dimension in exactly those replicated regions.
- Each device holds activations for only its slice of tokens there.
- It pairs with tensor parallelism, switching between the two layouts inside each block.
- The switch uses cheap all gather and reduce scatter collectives instead of full all reduce.
Why it pays off
- Activation memory in the norm and dropout regions drops by the parallel degree.
- It enables longer context or larger batches without extra devices.
- Total communication volume stays similar to plain tensor parallelism.
Relation to context parallelism
For very long contexts, context parallelism extends the idea to the attention computation itself, sharding keys and values across the sequence so no single device holds the full attention matrix.
Key idea
Sequence parallelism shards activations along sequence length in the layer norm and dropout regions, cutting activation memory and pairing with tensor parallelism via all gather and reduce scatter.