When you need to replay
Sometimes you must process data you already consumed again:
- A bug corrupted a derived view and you must recompute it correctly.
- A new feature needs a view built from all historical events, not just future ones.
- A schema change means old records need reinterpretation.
Because the log retains records, you can reread from an earlier offset, called a backfill or reprocessing.
Run it on a parallel path
The safest pattern is to not touch the live consumer. Instead:
- Start a new consumer group that reads from offset zero into a new output table.
- Let it catch up to the present while the live pipeline keeps serving.
- Cut over reads to the new table once it is current, then retire the old one.
Things that bite
- Downstream side effects: replaying must not resend emails or charge cards, so reprocessing should target idempotent or isolated sinks only.
- Load: a full replay can flood databases and brokers, so throttle it and run off peak.
- Ordering with live writes: if the backfill and live stream write the same table, late historical records can clobber newer state, so prefer separate tables and a clean cutover.
Key idea
Reprocessing replays the retained log into a fresh output on a parallel consumer group, then cuts over, keeping the live pipeline untouched and avoiding duplicate external side effects.