← Lessons

quiz vs the machine

Gold1420

Machine Learning

Model Sharding Across GPUs

Split a model too big for one GPU across several of them.

5 min read · core · beat Gold to climb

When one GPU is not enough

Large models hold billions of weights that may not fit in a single GPU memory. Sharding splits the model across multiple GPUs so they hold and compute different pieces together.

Two ways to split

  • Tensor parallelism splits individual layers, so each GPU computes part of every layer and they combine results.
  • Pipeline parallelism assigns whole layers to different GPUs, passing activations down the chain like a pipeline.

The communication cost

Sharding adds traffic between GPUs. Tensor parallelism needs frequent fast exchanges within each layer, so it wants high speed links. Pipeline parallelism passes data only at layer boundaries but can leave GPUs idle waiting for the previous stage.

Why it matters for serving

  • It is the only way to serve a model that exceeds one GPU memory.
  • The interconnect speed often limits how fast a sharded model can respond.
  • More shards mean more coordination, so latency does not simply drop with more GPUs.

Key idea

Sharding spreads a model across GPUs by splitting layers tensorwise or assigning layers pipelinewise. It unlocks models too large for one device but pays in cross GPU communication that the interconnect speed governs.

Check yourself

Answer to earn rating on the learn ladder.

1. What is the difference between tensor and pipeline parallelism?

2. Why does adding more shards not always reduce latency?