Patterns
Retry With Exponential Backoff And Jitter
Retry transient failures with increasing delays and randomness so recovery does not create a synchronized traffic spike.
Concepts Covered
- Transient failures
- Retry storms
- Exponential backoff
- Jitter
- Retry budgets
- Idempotency
- Retryable errors
- User latency budgets
1. Intent
Retry with exponential backoff and jitter makes retries safer.
Instead of retrying immediately in a tight loop, clients wait longer after each failure and add randomness so many clients do not retry at the same instant.
The intent is not "retry everything." The intent is to recover from temporary failure without turning recovery into more traffic than the system can handle.
2. The Problem Without This Pattern
Retries can make outages worse.
Imagine a dependency slows down. Every caller times out and retries immediately:
normal traffic: 10,000 requests/sec
timeout happens
clients retry immediately
traffic becomes 20,000 or 30,000 requests/sec
dependency gets even slower
more retries happen
This is a retry storm. The original failure may be temporary, but synchronized retries amplify it into sustained overload.
Retries are especially dangerous when the operation is not idempotent. Retrying a message send, payment charge, like event, or counter increment without deduplication can create duplicate business effects.
3. How The Pattern Works
A basic retry plan:
attempt 1 fails -> wait 100ms + jitter
attempt 2 fails -> wait 200ms + jitter
attempt 3 fails -> wait 400ms + jitter
attempt 4 fails -> wait 800ms + jitter
stop after max attempts or retry budget
Exponential backoff increases the delay after each failure. Jitter adds randomness to spread retry traffic over time.
The system should also define:
- which errors are retryable
- max attempts
- max total retry time
- per-request timeout
- retry budget
- whether the operation is idempotent
Retries should have an end. Infinite retries without backpressure can create permanent background load.
4. When To Use It
Use this pattern for:
- transient network failures
- temporary dependency errors
- queue publish retries
- idempotent API calls
- background jobs
- push notification provider retries
- outbox publisher retries
It is especially useful when the caller can safely try again later and the downstream dependency needs time to recover.
5. When Not To Use It
Do not blindly retry:
- non-idempotent operations
- validation errors
- authentication failures
- authorization failures
- permanent business-rule failures
- calls where the user latency budget is already exhausted
- dependencies protected by an open circuit breaker
If a request fails because the payload is invalid, retrying only wastes capacity. If a request fails because the downstream service is overloaded, aggressive retries may make recovery harder.
6. Data And Operational Model
Good retry systems define:
- retryable error classes
- base delay
- maximum delay
- jitter strategy
- max attempts
- total retry deadline
- retry budget per caller or workload
- idempotency requirement
Operators should monitor:
- retry rate by dependency
- final failure rate after retries
- retry success rate
- latency added by retries
- retry budget exhaustion
- downstream saturation during retry waves
Retries should be visible. Hidden retries make latency unpredictable and make incidents harder to understand.
7. Failure Modes
- Retrying non-idempotent operations duplicates side effects.
- Too many retries overload dependencies.
- No jitter creates synchronized retry waves.
- Retrying permanent errors wastes capacity.
- Hidden retries make latency unpredictable.
- Retry timeouts are longer than the user's patience.
- Every service in a call chain retries, multiplying traffic at each layer.
8. Tradeoffs
| Benefit | Cost |
|---|---|
| Handles transient failures | Adds latency |
| Reduces synchronized retry spikes | More client logic |
| Improves resilience | Can amplify load if misused |
| Works well in background jobs | Requires retry budgets |
| Gives dependencies recovery time | May delay final failure |
Retries are useful when failure is temporary. They are dangerous when they hide overload or duplicate side effects.
9. Related Systems And Concepts
Knowledge links
Use these links to understand what to know first, where this idea appears, and what to study next.
Prerequisites
Read these first if this topic feels unfamiliar.
Related Concepts
Core ideas that connect to this topic.
Related Patterns
Reusable architecture moves built from these ideas.