← Lessons

quiz vs the machine

Platinum1820

System Design

Backfill and Reprocessing

Recomputing historical data safely after a bug fix or new logic.

5 min read · advanced · beat Platinum to climb

When you must rebuild history

A backfill recomputes past data, for example after fixing a transform bug or adding a new column. Reprocessing is the general act of rerunning historical windows. Doing this safely on production data is surprisingly hard.

Core requirements

  • Idempotency so each historical window can be overwritten without duplicating.
  • Partitioned data so you can target specific date ranges instead of the whole table.
  • Decoupled runs so the backfill does not collide with the live daily pipeline writing the same tables.

Strategies

  • Shadow tables: build the corrected data in a parallel table, validate it, then swap it in atomically.
  • Window by window: process old partitions in chunks to bound resource use and allow checkpointing.
  • Throttling: cap parallelism so a large backfill does not starve live jobs of compute.

Watch outs

Be careful with late arriving dimensions and code that depends on values that have since changed, since naive reruns may silently produce different numbers than the original.

Key idea

Safe backfills need idempotent partitioned jobs run window by window, often into a validated shadow table swapped in atomically.

Check yourself

Answer to earn rating on the learn ladder.

1. What property is essential for safe backfilling?

2. Why build a backfill in a shadow table first?

3. Why throttle a large backfill?