Database Sharding

How to split one giant table across many machines when a single database can no longer keep up.

When one box is not enough

Sharding splits a dataset horizontally across multiple databases so each holds only a slice of the rows. It lets you grow past the limits of a single machine in storage, write throughput, and memory.

Each shard is a full database holding a disjoint subset of the data. A query for a given row goes only to the shard that owns it.

Choosing a shard key

The shard key decides which shard a row lands on, and choosing it is the hard part.

Range sharding splits by key ranges, which is great for range scans but can create hot spots if recent data clusters together.
Hash sharding spreads keys evenly but makes range scans touch every shard.

A good key spreads both data and traffic evenly and matches how you query.

Costs you accept

Sharding complicates life. Queries that span shards need fan out and merge. Cross shard transactions are hard. Rebalancing data when you add shards takes care, which is why consistent hashing is often paired with it.

Key idea

Sharding scales writes by splitting rows across machines, at the cost of harder cross shard queries.

When one box is not enough

Choosing a shard key

Costs you accept

Key idea

Check yourself