← Lessons

quiz vs the machine

Platinum1780

System Design

Backfilling and Reprocessing

Replay historical data to fix bugs or build new views without disrupting live consumers.

6 min read · advanced · beat Platinum to climb

When you need to replay

Sometimes you must process data you already consumed again:

  • A bug corrupted a derived view and you must recompute it correctly.
  • A new feature needs a view built from all historical events, not just future ones.
  • A schema change means old records need reinterpretation.

Because the log retains records, you can reread from an earlier offset, called a backfill or reprocessing.

Run it on a parallel path

The safest pattern is to not touch the live consumer. Instead:

  • Start a new consumer group that reads from offset zero into a new output table.
  • Let it catch up to the present while the live pipeline keeps serving.
  • Cut over reads to the new table once it is current, then retire the old one.

Things that bite

  • Downstream side effects: replaying must not resend emails or charge cards, so reprocessing should target idempotent or isolated sinks only.
  • Load: a full replay can flood databases and brokers, so throttle it and run off peak.
  • Ordering with live writes: if the backfill and live stream write the same table, late historical records can clobber newer state, so prefer separate tables and a clean cutover.

Key idea

Reprocessing replays the retained log into a fresh output on a parallel consumer group, then cuts over, keeping the live pipeline untouched and avoiding duplicate external side effects.

Check yourself

Answer to earn rating on the learn ladder.

1. Why run a backfill on a parallel consumer group and new table?

2. What external risk must reprocessing avoid?