Patterns

Retry With Exponential Backoff And Jitter

Retry transient failures with increasing delays and randomness so recovery does not create a synchronized traffic spike.

foundation4 min readUpdated unknownReliabilityOperationsTradeoffs
Transient FailuresRetry StormsBackoffJitter

Concepts Covered

  • Transient failures
  • Retry storms
  • Exponential backoff
  • Jitter
  • Retry budgets
  • Idempotency
  • Retryable errors
  • User latency budgets

1. Intent

Retry with exponential backoff and jitter makes retries safer.

Instead of retrying immediately in a tight loop, clients wait longer after each failure and add randomness so many clients do not retry at the same instant.

The intent is not "retry everything." The intent is to recover from temporary failure without turning recovery into more traffic than the system can handle.

2. The Problem Without This Pattern

Retries can make outages worse.

Imagine a dependency slows down. Every caller times out and retries immediately:

normal traffic: 10,000 requests/sec
timeout happens
clients retry immediately
traffic becomes 20,000 or 30,000 requests/sec
dependency gets even slower
more retries happen

This is a retry storm. The original failure may be temporary, but synchronized retries amplify it into sustained overload.

Retries are especially dangerous when the operation is not idempotent. Retrying a message send, payment charge, like event, or counter increment without deduplication can create duplicate business effects.

3. How The Pattern Works

A basic retry plan:

attempt 1 fails -> wait 100ms + jitter
attempt 2 fails -> wait 200ms + jitter
attempt 3 fails -> wait 400ms + jitter
attempt 4 fails -> wait 800ms + jitter
stop after max attempts or retry budget

Exponential backoff increases the delay after each failure. Jitter adds randomness to spread retry traffic over time.

The system should also define:

  • which errors are retryable
  • max attempts
  • max total retry time
  • per-request timeout
  • retry budget
  • whether the operation is idempotent

Retries should have an end. Infinite retries without backpressure can create permanent background load.

4. When To Use It

Use this pattern for:

  • transient network failures
  • temporary dependency errors
  • queue publish retries
  • idempotent API calls
  • background jobs
  • push notification provider retries
  • outbox publisher retries

It is especially useful when the caller can safely try again later and the downstream dependency needs time to recover.

5. When Not To Use It

Do not blindly retry:

  • non-idempotent operations
  • validation errors
  • authentication failures
  • authorization failures
  • permanent business-rule failures
  • calls where the user latency budget is already exhausted
  • dependencies protected by an open circuit breaker

If a request fails because the payload is invalid, retrying only wastes capacity. If a request fails because the downstream service is overloaded, aggressive retries may make recovery harder.

6. Data And Operational Model

Good retry systems define:

  • retryable error classes
  • base delay
  • maximum delay
  • jitter strategy
  • max attempts
  • total retry deadline
  • retry budget per caller or workload
  • idempotency requirement

Operators should monitor:

  • retry rate by dependency
  • final failure rate after retries
  • retry success rate
  • latency added by retries
  • retry budget exhaustion
  • downstream saturation during retry waves

Retries should be visible. Hidden retries make latency unpredictable and make incidents harder to understand.

7. Failure Modes

  • Retrying non-idempotent operations duplicates side effects.
  • Too many retries overload dependencies.
  • No jitter creates synchronized retry waves.
  • Retrying permanent errors wastes capacity.
  • Hidden retries make latency unpredictable.
  • Retry timeouts are longer than the user's patience.
  • Every service in a call chain retries, multiplying traffic at each layer.

8. Tradeoffs

BenefitCost
Handles transient failuresAdds latency
Reduces synchronized retry spikesMore client logic
Improves resilienceCan amplify load if misused
Works well in background jobsRequires retry budgets
Gives dependencies recovery timeMay delay final failure

Retries are useful when failure is temporary. They are dangerous when they hide overload or duplicate side effects.

Knowledge links

Use these links to understand what to know first, where this idea appears, and what to study next.

Prerequisites

Read these first if this topic feels unfamiliar.

Related Concepts

Core ideas that connect to this topic.

Related Patterns

Reusable architecture moves built from these ideas.