Patterns
Dead-Letter Queue
Isolate messages that repeatedly fail processing so they do not block healthy event processing forever.
Concepts Covered
- Poison messages
- Retry limits
- Failed-event isolation
- Manual replay
- Operational triage
- Consumer progress
- Error metadata
- Replay safety
1. Intent
A Dead-Letter Queue, often shortened to DLQ, stores messages that could not be processed after repeated attempts.
The goal is to keep one bad message from blocking the rest of the stream or queue.
The DLQ does not make the failed message disappear. It moves the message into an explicit operational lane where it can be inspected, fixed, replayed, or intentionally discarded.
2. The Problem Without This Pattern
If a consumer fails on the same message forever, it may repeatedly retry that message and make no progress.
This is common when:
- payload schema is invalid
- required data is missing
- downstream dependency rejects the event
- consumer code has a bug for one edge case
- the event refers to deleted or unavailable state
Without a DLQ, one poison message can stall an entire partition or consumer group.
Example:
consumer reads event 842
event 842 always crashes parser
consumer retries event 842
event 843, 844, 845 never get processed
The failure is now larger than one bad event. It blocks healthy work behind it.
3. How The Pattern Works
Basic flow:
consume message
-> process
-> if success: acknowledge
-> if failure: retry with backoff
-> if retries exhausted: move to DLQ
-> continue processing later messages
The DLQ preserves the failed message plus metadata:
original_message
error_reason
attempt_count
first_failed_at
last_failed_at
consumer_name
trace_id
schema_version
This metadata matters. Operators need to know whether the failure came from bad data, a dependency outage, a code bug, or a schema mismatch.
4. When To Use It
Use a DLQ when:
- messages are processed asynchronously
- some messages may be invalid or unprocessable
- retries are useful but should not continue forever
- operators need to inspect and replay failures
- consumer progress matters
- poison messages can block ordered partitions
Good examples:
- analytics event processing
- notification workers
- outbox publishers
- message delivery workers
- projection builders
5. When Not To Use It
A DLQ is not a substitute for fixing consumer bugs.
Avoid treating the DLQ as a trash can. Every DLQ should have ownership, alerting, and a replay policy.
Be careful when:
- failed messages contain sensitive data
- replay can duplicate side effects
- the system cannot safely reprocess old events
- business operations require immediate human intervention
If nobody watches the DLQ, it becomes a silent data loss mechanism with a nicer name.
6. Data And Operational Model
DLQ records should include:
original_message
error
attempt_count
first_failed_at
last_failed_at
consumer_name
trace_id
partition
offset
replay_status
Operators should monitor:
- DLQ depth
- DLQ growth rate
- top error reasons
- replay success rate
- age of oldest DLQ message
- sensitive payload exposure risk
- repeated failures after replay
Replay needs controls. A replayed message should go through idempotent processing, and operators should be able to replay a single message, a filtered group, or a time window.
7. Failure Modes
- DLQ grows silently without alerting.
- Replaying messages repeats the same failure.
- Sensitive data is stored in DLQ payloads without controls.
- Messages are dead-lettered too quickly.
- Messages are retried too long and block progress.
- Replay duplicates side effects because consumers are not idempotent.
- Operators fix code but forget to replay affected messages.
8. Tradeoffs
| Benefit | Cost |
|---|---|
| Keeps consumers moving | Requires operational ownership |
| Preserves failed messages | Can hide failures if ignored |
| Enables replay after fixes | Replay needs safety controls |
| Limits poison-message damage | Retry threshold tuning matters |
| Improves incident triage | Stores potentially sensitive payloads |
A DLQ is useful only if it is treated as an active repair workflow.
9. Related Systems And Concepts
Knowledge links
Use these links to understand what to know first, where this idea appears, and what to study next.
Prerequisites
Read these first if this topic feels unfamiliar.
Related Concepts
Core ideas that connect to this topic.
Related Patterns
Reusable architecture moves built from these ideas.