Model Sharding Across GPUs

When one GPU is not enough

Large models hold billions of weights that may not fit in a single GPU memory. Sharding splits the model across multiple GPUs so they hold and compute different pieces together.

Two ways to split

Tensor parallelism splits individual layers, so each GPU computes part of every layer and they combine results.
Pipeline parallelism assigns whole layers to different GPUs, passing activations down the chain like a pipeline.

The communication cost

Sharding adds traffic between GPUs. Tensor parallelism needs frequent fast exchanges within each layer, so it wants high speed links. Pipeline parallelism passes data only at layer boundaries but can leave GPUs idle waiting for the previous stage.

Why it matters for serving

It is the only way to serve a model that exceeds one GPU memory.
The interconnect speed often limits how fast a sharded model can respond.
More shards mean more coordination, so latency does not simply drop with more GPUs.

Key idea

Sharding spreads a model across GPUs by splitting layers tensorwise or assigning layers pipelinewise. It unlocks models too large for one device but pays in cross GPU communication that the interconnect speed governs.

Model Sharding Across GPUs

When one GPU is not enough

Two ways to split

The communication cost

Why it matters for serving

Key idea

Check yourself