← Lessons

quiz vs the machine

Gold1380

System Design

Dead Letter Handling in Streams

Isolate poison messages so one bad record cannot stall an entire partition forever.

5 min read · core · beat Gold to climb

The poison message problem

A consumer reads records in order from a partition. If one record cannot be processed, perhaps it is malformed or violates a rule, naive retry logic loops on it forever. Because the partition is ordered, that single poison message blocks every record behind it, and the consumer lag grows without bound.

The dead letter queue

The fix is to stop retrying a hopeless record and move it aside. After a bounded number of retries, the consumer publishes the failed record to a dead letter queue, a separate topic, then commits past it and continues with the next record.

What to record

  • Capture the original payload, the error, the topic and offset, and a retry count so a human or job can diagnose and possibly replay it.

Retry strategy

  • Bounded retries with backoff handle transient failures like a brief downstream outage.
  • Only persistent failures should reach the dead letter queue, not transient ones.
  • Some designs use a retry topic with a delay before the dead letter queue, separating slow recoverable errors from truly broken records.

Key idea

A dead letter queue keeps one poison message from blocking an ordered partition by moving persistently failing records aside after bounded retries, so the consumer can advance and the bad records are kept for later analysis.

Check yourself

Answer to earn rating on the learn ladder.

1. Why does a poison message stall a partition?

2. What is the role of a dead letter queue?