Concepts

Rate Limiting

Control how frequently clients can perform actions so systems remain available, fair, and protected from accidental or abusive overload.

foundation4 min readUpdated unknownReliabilityOperationsTradeoffs

Token BucketLeaky BucketQuotasBackpressureLoad Shedding

Concepts Covered

Quotas
Token bucket
Leaky bucket
Abuse prevention
Fairness
Load shedding
Distributed rate limits
Retry behavior

Definition

Rate limiting restricts how often a caller can perform an action.

The caller might be:

a user
an IP address
an API key
an organization
a device
another internal service

The action might be creating a short link, sending a message, attempting login, liking a post, calling an API, or uploading media.

Rate limiting exists to keep systems available when some callers send more traffic than the system should accept.

The Pain That Forces Rate Limiting

Without rate limits, one caller can consume shared capacity.

Example:

1. One client sends 20,000 create-link requests per minute.
2. API workers spend time on that client.
3. Database writes increase.
4. Abuse checks and analytics queues fill.
5. Normal users experience slow requests.

The system may be healthy for normal usage, but unhealthy under unfair usage.

Rate limiting gives the system a boundary:

This caller can perform this action this many times in this period.

It protects reliability, cost, abuse surfaces, and fairness.

Mental Model

Rate limiting is controlled refusal.

It is not a failure of the system. It is the system intentionally saying:

Accepting this request would harm reliability or fairness.

Good rate limiting is specific. It limits the smallest useful identity and action. A public API might limit by API key and endpoint. A login system might limit by account and IP. A messaging system might limit sends per user, per conversation, and per device.

Bad rate limiting is too blunt. It blocks legitimate users, hides real capacity problems, or creates confusing retry behavior.

Common Algorithms

Algorithm	Intuition	Good for
Fixed window	Count requests in a fixed time window	Simple quotas
Sliding window	Smooth counts across window boundaries	Fairer API limits
Token bucket	Tokens refill over time and requests spend tokens	Allowing controlled bursts
Leaky bucket	Requests drain at a steady rate	Smoothing traffic

Token bucket is common because it supports bursts without allowing unlimited sustained traffic.

Example:

bucket capacity: 100 tokens
refill rate: 10 tokens per second
request cost: 1 token

If the caller has tokens, the request proceeds. If the bucket is empty, the request is rejected or delayed.

What Happens When Limited

When a request exceeds the limit, the system can:

reject it with 429 Too Many Requests
return a Retry-After header
delay it in a queue
degrade the response
require stronger verification
silently drop low-value background work

For user-facing APIs, explicit responses are usually better. Clients need to know whether they should retry, wait, or stop.

For internal systems, rate limits can act like backpressure. A downstream service that is overloaded can force callers to slow down instead of letting retries create a collapse.

Distributed Rate Limits

Rate limiting is simple on one server and harder across many servers.

If every API server keeps its own counter, a caller can exceed the intended global limit by spreading requests across servers.

server A allows 100/min
server B allows 100/min
server C allows 100/min
actual global usage: 300/min

To enforce global limits, systems often use shared stores such as Redis, local approximations with periodic sync, regional limiters, or dedicated rate limit services.

The tradeoff is latency and availability. A central limiter can be accurate, but it can also become a dependency on the critical path.

Design Questions

Important questions:

What identity is being limited?
What action is being limited?
Should limits allow bursts?
Is the limit global, regional, per tenant, or per server?
What response does the client receive?
Are limits different for trusted users, paid users, or internal jobs?
What happens if the rate limiter is unavailable?
How do clients avoid retry storms after being limited?

Rate limits are product decisions as much as infrastructure decisions.

Operational Reality

Watch:

requests allowed versus denied
top limited identities
limiter latency
limiter error rate
false positives from legitimate users
retry traffic after 429
abuse patterns by endpoint
downstream saturation avoided by limits

Rate limiting is not just about saying no. It is about preserving system health so legitimate work can continue.

Knowledge links

Use these links to understand what to know first, where this idea appears, and what to study next.

Prerequisites

Read these first if this topic feels unfamiliar.

Backpressure

Used In Systems

System studies where this idea appears in context.

URL Shortener System WhatsApp-Style Messaging System

Related Patterns

Reusable architecture moves built from these ideas.

Bulkhead Isolation Circuit Breaker