Concepts
Rate Limiting
Control how frequently clients can perform actions so systems remain available, fair, and protected from accidental or abusive overload.
Concepts Covered
- Quotas
- Token bucket
- Leaky bucket
- Abuse prevention
- Fairness
- Load shedding
- Distributed rate limits
- Retry behavior
Definition
Rate limiting restricts how often a caller can perform an action.
The caller might be:
- a user
- an IP address
- an API key
- an organization
- a device
- another internal service
The action might be creating a short link, sending a message, attempting login, liking a post, calling an API, or uploading media.
Rate limiting exists to keep systems available when some callers send more traffic than the system should accept.
The Pain That Forces Rate Limiting
Without rate limits, one caller can consume shared capacity.
Example:
1. One client sends 20,000 create-link requests per minute.
2. API workers spend time on that client.
3. Database writes increase.
4. Abuse checks and analytics queues fill.
5. Normal users experience slow requests.
The system may be healthy for normal usage, but unhealthy under unfair usage.
Rate limiting gives the system a boundary:
This caller can perform this action this many times in this period.
It protects reliability, cost, abuse surfaces, and fairness.
Mental Model
Rate limiting is controlled refusal.
It is not a failure of the system. It is the system intentionally saying:
Accepting this request would harm reliability or fairness.
Good rate limiting is specific. It limits the smallest useful identity and action. A public API might limit by API key and endpoint. A login system might limit by account and IP. A messaging system might limit sends per user, per conversation, and per device.
Bad rate limiting is too blunt. It blocks legitimate users, hides real capacity problems, or creates confusing retry behavior.
Common Algorithms
| Algorithm | Intuition | Good for |
|---|---|---|
| Fixed window | Count requests in a fixed time window | Simple quotas |
| Sliding window | Smooth counts across window boundaries | Fairer API limits |
| Token bucket | Tokens refill over time and requests spend tokens | Allowing controlled bursts |
| Leaky bucket | Requests drain at a steady rate | Smoothing traffic |
Token bucket is common because it supports bursts without allowing unlimited sustained traffic.
Example:
bucket capacity: 100 tokens
refill rate: 10 tokens per second
request cost: 1 token
If the caller has tokens, the request proceeds. If the bucket is empty, the request is rejected or delayed.
What Happens When Limited
When a request exceeds the limit, the system can:
- reject it with
429 Too Many Requests - return a
Retry-Afterheader - delay it in a queue
- degrade the response
- require stronger verification
- silently drop low-value background work
For user-facing APIs, explicit responses are usually better. Clients need to know whether they should retry, wait, or stop.
For internal systems, rate limits can act like backpressure. A downstream service that is overloaded can force callers to slow down instead of letting retries create a collapse.
Distributed Rate Limits
Rate limiting is simple on one server and harder across many servers.
If every API server keeps its own counter, a caller can exceed the intended global limit by spreading requests across servers.
server A allows 100/min
server B allows 100/min
server C allows 100/min
actual global usage: 300/min
To enforce global limits, systems often use shared stores such as Redis, local approximations with periodic sync, regional limiters, or dedicated rate limit services.
The tradeoff is latency and availability. A central limiter can be accurate, but it can also become a dependency on the critical path.
Design Questions
Important questions:
- What identity is being limited?
- What action is being limited?
- Should limits allow bursts?
- Is the limit global, regional, per tenant, or per server?
- What response does the client receive?
- Are limits different for trusted users, paid users, or internal jobs?
- What happens if the rate limiter is unavailable?
- How do clients avoid retry storms after being limited?
Rate limits are product decisions as much as infrastructure decisions.
Operational Reality
Watch:
- requests allowed versus denied
- top limited identities
- limiter latency
- limiter error rate
- false positives from legitimate users
- retry traffic after
429 - abuse patterns by endpoint
- downstream saturation avoided by limits
Rate limiting is not just about saying no. It is about preserving system health so legitimate work can continue.
Related Topics
Knowledge links
Use these links to understand what to know first, where this idea appears, and what to study next.
Prerequisites
Read these first if this topic feels unfamiliar.
Used In Systems
System studies where this idea appears in context.
Related Patterns
Reusable architecture moves built from these ideas.