Protecting the Downstream
A queue can hand workers far more jobs than a downstream API allows. If workers drain at full speed they trip the third party rate limit, get errors, and trigger retries that make things worse. A rate limited consumer caps how fast it processes.
Token Bucket
The common mechanism is a token bucket. Tokens refill at the allowed rate, say ten per second. A worker must take a token before processing a job. When the bucket is empty, workers wait. This permits short bursts up to the bucket size while holding the long run average at the limit.
Shared Limits Across Workers
A single worker can throttle itself locally, but many workers share one downstream limit. The bucket must be shared, typically in a central store, so the whole pool together stays under the cap rather than each worker alone.
Per Tenant Fairness
If one customer floods the queue, they can consume the whole rate budget and starve others. Use a bucket per tenant so each customer gets a fair share of throughput.
Reacting to the Downstream
Respect signals from the downstream. If it returns a slow down response with a retry after hint, back off for that duration rather than guessing.
Key idea
A rate limited consumer uses a shared token bucket so the whole worker pool stays under a downstream limit, with per tenant buckets for fairness.