Backfilling and Reprocessing

Replay historical data to fix bugs or build new views without disrupting live consumers.

When you need to replay

Sometimes you must process data you already consumed again:

A bug corrupted a derived view and you must recompute it correctly.
A new feature needs a view built from all historical events, not just future ones.
A schema change means old records need reinterpretation.

Because the log retains records, you can reread from an earlier offset, called a backfill or reprocessing.

Run it on a parallel path

The safest pattern is to not touch the live consumer. Instead:

Start a new consumer group that reads from offset zero into a new output table.
Let it catch up to the present while the live pipeline keeps serving.
Cut over reads to the new table once it is current, then retire the old one.

Things that bite

Downstream side effects: replaying must not resend emails or charge cards, so reprocessing should target idempotent or isolated sinks only.
Load: a full replay can flood databases and brokers, so throttle it and run off peak.
Ordering with live writes: if the backfill and live stream write the same table, late historical records can clobber newer state, so prefer separate tables and a clean cutover.

Key idea

Reprocessing replays the retained log into a fresh output on a parallel consumer group, then cuts over, keeping the live pipeline untouched and avoiding duplicate external side effects.

Backfilling and Reprocessing

When you need to replay

Run it on a parallel path

Things that bite

Key idea

Check yourself