When one box is not enough
Sharding splits a dataset horizontally across multiple databases so each holds only a slice of the rows. It lets you grow past the limits of a single machine in storage, write throughput, and memory.
Each shard is a full database holding a disjoint subset of the data. A query for a given row goes only to the shard that owns it.
Choosing a shard key
The shard key decides which shard a row lands on, and choosing it is the hard part.
- Range sharding splits by key ranges, which is great for range scans but can create hot spots if recent data clusters together.
- Hash sharding spreads keys evenly but makes range scans touch every shard.
A good key spreads both data and traffic evenly and matches how you query.
Costs you accept
Sharding complicates life. Queries that span shards need fan out and merge. Cross shard transactions are hard. Rebalancing data when you add shards takes care, which is why consistent hashing is often paired with it.
Key idea
Sharding scales writes by splitting rows across machines, at the cost of harder cross shard queries.