Requirements
- Limit each client to a fixed number of requests per time window.
- Decisions must be fast and consistent across many API servers.
- Reject excess requests cleanly with a retry hint.
High level design
A limiter sits in front of the service, keying on user id or IP. The common algorithms trade accuracy for memory.
- Token bucket: tokens refill at a steady rate and each request spends one, allowing bursts up to the bucket size.
- Sliding window: counts requests in the recent window for smoother limits without edge spikes.
- Fixed window: a simple counter per window that is cheap but allows double bursts at boundaries.
A shared store such as a fast in memory cache holds the counters so every server sees the same state.
Bottlenecks
- Shared counter latency: every request reads and writes the store, so colocate the cache and use atomic increments.
- Race conditions: read then write is not atomic, so use a single atomic operation or a small script that runs server side.
- Distributed accuracy: local per node limits drift, so a central store keeps the global count honest at the cost of one network hop.
Return a clear status code with a retry after header so clients back off instead of hammering.
Key idea
A rate limiter is a fast shared counter problem, and the algorithm choice balances burst tolerance against memory and accuracy.