Why limit rates
A rate limiter caps how many requests a client may make in a window, protecting a service from abuse, runaway clients, and accidental floods. It sits in front of the real work and rejects or delays excess traffic.
The token bucket model
The common design is a token bucket per client:
- Tokens refill at a fixed rate, say ten per second, up to a cap.
- Each request consumes one token.
- If the bucket has a token, allow the request. If empty, reject or queue it.
The cap allows short bursts, while the refill rate bounds the long run average. Other models include fixed window, sliding window, and leaky bucket.
Where it runs
- A limiter at the edge or gateway blocks abuse before it reaches services.
- Counters live in a fast shared store so limits hold across many servers, not just one.
What to return
- A rejected request gets a clear status, often too many requests, with a retry hint.
- Limits are keyed by client, by api key, user id, or ip.
Key idea
A rate limiter caps client request rates, commonly with a token bucket that refills at a fixed rate and allows bursts up to a cap, using a shared counter so limits hold across servers.