← Lessons

quiz vs the machine

Gold1460

Databases

The Distributed Joins Problem

Why joining tables across shards is so painful.

5 min read · core · beat Gold to climb

Joins Across the Network

On a single node a join reads two tables that already sit together. Once tables are sharded, the rows you need to join may live on different nodes, and joining them means shuffling data across the network.

Why It Hurts

A naive distributed join can move enormous data. To join orders with users when they are sharded differently, the system may ship matching rows between shards so they can be paired. This shuffle consumes network bandwidth and memory, and its cost grows with data size.

Techniques That Help

  • Co located joins Shard both tables by the same key, such as user id, so a user and their orders land on the same shard. The join then runs locally with no network shuffle.
  • Broadcast join Replicate a small table to every shard so each shard joins locally against its copy. This works only when one side is small.
  • Denormalization Fold the related data into one record up front so no join is needed at read time.

The Real Lesson

Distributed databases push designers to avoid cross shard joins rather than optimize them. The data model, especially the choice of shard key, decides whether joins stay local or turn into expensive shuffles.

Key idea

Cross shard joins force costly network shuffles; co locating by a shared shard key or denormalizing keeps joins local and cheap.

Check yourself

Answer to earn rating on the learn ladder.

1. Why are distributed joins expensive?

2. How does a co located join avoid a shuffle?

3. When is a broadcast join appropriate?