← Lessons

quiz vs the machine

Gold1450

System Design

The GitHub Code Hosting

Serving millions of repos means sharding git storage and caching hot reads.

5 min read · core · beat Gold to climb

Repos are the unit

GitHub hosts millions of git repositories. Each repo is a bundle of objects, and the traffic is heavily skewed: a few popular repos get enormous read traffic while most are quiet.

Sharding repositories

A single server cannot hold every repo, so repos are sharded across many file servers. A routing layer maps a repo to its home server, and that server handles its git operations.

  • Repos are distributed across storage servers
  • A routing layer maps repo to server
  • Replicas provide redundancy and read capacity

Caching the hot path

Web views like the file browser and pull request pages are read heavy. These rendered views and frequently fetched objects are cached so a viral repo does not overload its storage server.

Scale comes from spreading repos across servers, and stability comes from caching the small set of very hot reads.

Key idea

Shard repositories across storage servers behind a routing layer, and cache the heavily read rendered views so popular repos do not overwhelm a single server.

Check yourself

Answer to earn rating on the learn ladder.

1. Why are repositories sharded across servers?

2. What problem does caching rendered views address?