Why limit at all
Rate limiting caps how many requests a client may make in a window. It protects a service from abuse, runaway clients, and accidental traffic spikes, and it keeps capacity fair across many users.
When a client exceeds its limit the server typically returns a status meaning too many requests, often with a hint telling the client when to retry.
The token bucket
The most popular algorithm is the token bucket. A bucket holds tokens up to a fixed capacity and refills at a steady rate. Each request removes one token. If the bucket is empty the request is rejected or delayed.
- Capacity sets how large a burst is allowed.
- Refill rate sets the sustained throughput over time.
This lets short bursts through while still bounding the long run average.
Other approaches
- Fixed window counts requests per clock interval but can allow double the limit at the boundary.
- Sliding window smooths that boundary by weighting the previous window.
In a distributed setup the counter usually lives in a shared store so all servers agree on one tally per client.
Key idea
A token bucket allows bursts up to a cap while bounding the steady rate over time.