Concepts

Rate Limiting

Control how frequently clients can perform actions so systems remain available, fair, and protected from accidental or abusive overload.

foundation4 min readUpdated unknownReliabilityOperationsTradeoffs
Token BucketLeaky BucketQuotasBackpressureLoad Shedding

Concepts Covered

  • Quotas
  • Token bucket
  • Leaky bucket
  • Abuse prevention
  • Fairness
  • Load shedding
  • Distributed rate limits
  • Retry behavior

Definition

Rate limiting restricts how often a caller can perform an action.

The caller might be:

  • a user
  • an IP address
  • an API key
  • an organization
  • a device
  • another internal service

The action might be creating a short link, sending a message, attempting login, liking a post, calling an API, or uploading media.

Rate limiting exists to keep systems available when some callers send more traffic than the system should accept.

The Pain That Forces Rate Limiting

Without rate limits, one caller can consume shared capacity.

Example:

1. One client sends 20,000 create-link requests per minute.
2. API workers spend time on that client.
3. Database writes increase.
4. Abuse checks and analytics queues fill.
5. Normal users experience slow requests.

The system may be healthy for normal usage, but unhealthy under unfair usage.

Rate limiting gives the system a boundary:

This caller can perform this action this many times in this period.

It protects reliability, cost, abuse surfaces, and fairness.

Mental Model

Rate limiting is controlled refusal.

It is not a failure of the system. It is the system intentionally saying:

Accepting this request would harm reliability or fairness.

Good rate limiting is specific. It limits the smallest useful identity and action. A public API might limit by API key and endpoint. A login system might limit by account and IP. A messaging system might limit sends per user, per conversation, and per device.

Bad rate limiting is too blunt. It blocks legitimate users, hides real capacity problems, or creates confusing retry behavior.

Common Algorithms

AlgorithmIntuitionGood for
Fixed windowCount requests in a fixed time windowSimple quotas
Sliding windowSmooth counts across window boundariesFairer API limits
Token bucketTokens refill over time and requests spend tokensAllowing controlled bursts
Leaky bucketRequests drain at a steady rateSmoothing traffic

Token bucket is common because it supports bursts without allowing unlimited sustained traffic.

Example:

bucket capacity: 100 tokens
refill rate: 10 tokens per second
request cost: 1 token

If the caller has tokens, the request proceeds. If the bucket is empty, the request is rejected or delayed.

What Happens When Limited

When a request exceeds the limit, the system can:

  • reject it with 429 Too Many Requests
  • return a Retry-After header
  • delay it in a queue
  • degrade the response
  • require stronger verification
  • silently drop low-value background work

For user-facing APIs, explicit responses are usually better. Clients need to know whether they should retry, wait, or stop.

For internal systems, rate limits can act like backpressure. A downstream service that is overloaded can force callers to slow down instead of letting retries create a collapse.

Distributed Rate Limits

Rate limiting is simple on one server and harder across many servers.

If every API server keeps its own counter, a caller can exceed the intended global limit by spreading requests across servers.

server A allows 100/min
server B allows 100/min
server C allows 100/min
actual global usage: 300/min

To enforce global limits, systems often use shared stores such as Redis, local approximations with periodic sync, regional limiters, or dedicated rate limit services.

The tradeoff is latency and availability. A central limiter can be accurate, but it can also become a dependency on the critical path.

Design Questions

Important questions:

  • What identity is being limited?
  • What action is being limited?
  • Should limits allow bursts?
  • Is the limit global, regional, per tenant, or per server?
  • What response does the client receive?
  • Are limits different for trusted users, paid users, or internal jobs?
  • What happens if the rate limiter is unavailable?
  • How do clients avoid retry storms after being limited?

Rate limits are product decisions as much as infrastructure decisions.

Operational Reality

Watch:

  • requests allowed versus denied
  • top limited identities
  • limiter latency
  • limiter error rate
  • false positives from legitimate users
  • retry traffic after 429
  • abuse patterns by endpoint
  • downstream saturation avoided by limits

Rate limiting is not just about saying no. It is about preserving system health so legitimate work can continue.

Knowledge links

Use these links to understand what to know first, where this idea appears, and what to study next.

Prerequisites

Read these first if this topic feels unfamiliar.

Used In Systems

System studies where this idea appears in context.

Related Patterns

Reusable architecture moves built from these ideas.