Joins Across the Network
On a single node a join reads two tables that already sit together. Once tables are sharded, the rows you need to join may live on different nodes, and joining them means shuffling data across the network.
Why It Hurts
A naive distributed join can move enormous data. To join orders with users when they are sharded differently, the system may ship matching rows between shards so they can be paired. This shuffle consumes network bandwidth and memory, and its cost grows with data size.
Techniques That Help
- Co located joins Shard both tables by the same key, such as user id, so a user and their orders land on the same shard. The join then runs locally with no network shuffle.
- Broadcast join Replicate a small table to every shard so each shard joins locally against its copy. This works only when one side is small.
- Denormalization Fold the related data into one record up front so no join is needed at read time.
The Real Lesson
Distributed databases push designers to avoid cross shard joins rather than optimize them. The data model, especially the choice of shard key, decides whether joins stay local or turn into expensive shuffles.
Key idea
Cross shard joins force costly network shuffles; co locating by a shared shard key or denormalizing keeps joins local and cheap.