When one GPU is not enough
Large models hold billions of weights that may not fit in a single GPU memory. Sharding splits the model across multiple GPUs so they hold and compute different pieces together.
Two ways to split
- Tensor parallelism splits individual layers, so each GPU computes part of every layer and they combine results.
- Pipeline parallelism assigns whole layers to different GPUs, passing activations down the chain like a pipeline.
The communication cost
Sharding adds traffic between GPUs. Tensor parallelism needs frequent fast exchanges within each layer, so it wants high speed links. Pipeline parallelism passes data only at layer boundaries but can leave GPUs idle waiting for the previous stage.
Why it matters for serving
- It is the only way to serve a model that exceeds one GPU memory.
- The interconnect speed often limits how fast a sharded model can respond.
- More shards mean more coordination, so latency does not simply drop with more GPUs.
Key idea
Sharding spreads a model across GPUs by splitting layers tensorwise or assigning layers pipelinewise. It unlocks models too large for one device but pays in cross GPU communication that the interconnect speed governs.