Repos are the unit
GitHub hosts millions of git repositories. Each repo is a bundle of objects, and the traffic is heavily skewed: a few popular repos get enormous read traffic while most are quiet.
Sharding repositories
A single server cannot hold every repo, so repos are sharded across many file servers. A routing layer maps a repo to its home server, and that server handles its git operations.
- Repos are distributed across storage servers
- A routing layer maps repo to server
- Replicas provide redundancy and read capacity
Caching the hot path
Web views like the file browser and pull request pages are read heavy. These rendered views and frequently fetched objects are cached so a viral repo does not overload its storage server.
Scale comes from spreading repos across servers, and stability comes from caching the small set of very hot reads.
Key idea
Shard repositories across storage servers behind a routing layer, and cache the heavily read rendered views so popular repos do not overwhelm a single server.