The Distributed Joins Problem

Joins Across the Network

On a single node a join reads two tables that already sit together. Once tables are sharded, the rows you need to join may live on different nodes, and joining them means shuffling data across the network.

Why It Hurts

A naive distributed join can move enormous data. To join orders with users when they are sharded differently, the system may ship matching rows between shards so they can be paired. This shuffle consumes network bandwidth and memory, and its cost grows with data size.

Techniques That Help

Co located joins Shard both tables by the same key, such as user id, so a user and their orders land on the same shard. The join then runs locally with no network shuffle.
Broadcast join Replicate a small table to every shard so each shard joins locally against its copy. This works only when one side is small.
Denormalization Fold the related data into one record up front so no join is needed at read time.

The Real Lesson

Distributed databases push designers to avoid cross shard joins rather than optimize them. The data model, especially the choice of shard key, decides whether joins stay local or turn into expensive shuffles.

Key idea

Cross shard joins force costly network shuffles; co locating by a shared shard key or denormalizing keeps joins local and cheap.

The Distributed Joins Problem

Joins Across the Network

Why It Hurts

Techniques That Help

The Real Lesson

Key idea

Check yourself