The poison message problem
A poison message is one that always fails no matter how often it is retried, perhaps because it is malformed or triggers a bug. Under at least once delivery, the broker redelivers it after each failure, so it can loop forever. Worse, if it sits at the head of an ordered queue, it can block every message behind it.
How to defuse it
- Track a delivery count or retry count per message.
- After a threshold, stop retrying and move the message to a dead letter queue for later inspection.
- Add backoff between retries so a transient failure gets a chance to recover without hammering the system.
Distinguishing causes
Not every failure means poison. A database being briefly down is transient and worth retrying. A schema violation is permanent and should dead letter quickly. Good handlers separate the two, retrying transient errors a few times but routing permanent ones straight to the dead letter queue.
Key idea
Cap retries with a delivery count and dead letter the message so a single poison message cannot loop forever or block the queue.