Patterns

Dead-Letter Queue

Isolate messages that repeatedly fail processing so they do not block healthy event processing forever.

foundation4 min readUpdated unknownReliabilityOperationsTradeoffs
RetriesPoison MessagesConsumer FailuresOperational Triage

Concepts Covered

  • Poison messages
  • Retry limits
  • Failed-event isolation
  • Manual replay
  • Operational triage
  • Consumer progress
  • Error metadata
  • Replay safety

1. Intent

A Dead-Letter Queue, often shortened to DLQ, stores messages that could not be processed after repeated attempts.

The goal is to keep one bad message from blocking the rest of the stream or queue.

The DLQ does not make the failed message disappear. It moves the message into an explicit operational lane where it can be inspected, fixed, replayed, or intentionally discarded.

2. The Problem Without This Pattern

If a consumer fails on the same message forever, it may repeatedly retry that message and make no progress.

This is common when:

  • payload schema is invalid
  • required data is missing
  • downstream dependency rejects the event
  • consumer code has a bug for one edge case
  • the event refers to deleted or unavailable state

Without a DLQ, one poison message can stall an entire partition or consumer group.

Example:

consumer reads event 842
event 842 always crashes parser
consumer retries event 842
event 843, 844, 845 never get processed

The failure is now larger than one bad event. It blocks healthy work behind it.

3. How The Pattern Works

Basic flow:

consume message
  -> process
  -> if success: acknowledge
  -> if failure: retry with backoff
  -> if retries exhausted: move to DLQ
  -> continue processing later messages

The DLQ preserves the failed message plus metadata:

original_message
error_reason
attempt_count
first_failed_at
last_failed_at
consumer_name
trace_id
schema_version

This metadata matters. Operators need to know whether the failure came from bad data, a dependency outage, a code bug, or a schema mismatch.

4. When To Use It

Use a DLQ when:

  • messages are processed asynchronously
  • some messages may be invalid or unprocessable
  • retries are useful but should not continue forever
  • operators need to inspect and replay failures
  • consumer progress matters
  • poison messages can block ordered partitions

Good examples:

  • analytics event processing
  • notification workers
  • outbox publishers
  • message delivery workers
  • projection builders

5. When Not To Use It

A DLQ is not a substitute for fixing consumer bugs.

Avoid treating the DLQ as a trash can. Every DLQ should have ownership, alerting, and a replay policy.

Be careful when:

  • failed messages contain sensitive data
  • replay can duplicate side effects
  • the system cannot safely reprocess old events
  • business operations require immediate human intervention

If nobody watches the DLQ, it becomes a silent data loss mechanism with a nicer name.

6. Data And Operational Model

DLQ records should include:

original_message
error
attempt_count
first_failed_at
last_failed_at
consumer_name
trace_id
partition
offset
replay_status

Operators should monitor:

  • DLQ depth
  • DLQ growth rate
  • top error reasons
  • replay success rate
  • age of oldest DLQ message
  • sensitive payload exposure risk
  • repeated failures after replay

Replay needs controls. A replayed message should go through idempotent processing, and operators should be able to replay a single message, a filtered group, or a time window.

7. Failure Modes

  • DLQ grows silently without alerting.
  • Replaying messages repeats the same failure.
  • Sensitive data is stored in DLQ payloads without controls.
  • Messages are dead-lettered too quickly.
  • Messages are retried too long and block progress.
  • Replay duplicates side effects because consumers are not idempotent.
  • Operators fix code but forget to replay affected messages.

8. Tradeoffs

BenefitCost
Keeps consumers movingRequires operational ownership
Preserves failed messagesCan hide failures if ignored
Enables replay after fixesReplay needs safety controls
Limits poison-message damageRetry threshold tuning matters
Improves incident triageStores potentially sensitive payloads

A DLQ is useful only if it is treated as an active repair workflow.

Knowledge links

Use these links to understand what to know first, where this idea appears, and what to study next.

Prerequisites

Read these first if this topic feels unfamiliar.

Related Concepts

Core ideas that connect to this topic.

Related Patterns

Reusable architecture moves built from these ideas.