Backfill and Reprocessing

When you must rebuild history

A backfill recomputes past data, for example after fixing a transform bug or adding a new column. Reprocessing is the general act of rerunning historical windows. Doing this safely on production data is surprisingly hard.

Core requirements

Idempotency so each historical window can be overwritten without duplicating.
Partitioned data so you can target specific date ranges instead of the whole table.
Decoupled runs so the backfill does not collide with the live daily pipeline writing the same tables.

Strategies

Shadow tables: build the corrected data in a parallel table, validate it, then swap it in atomically.
Window by window: process old partitions in chunks to bound resource use and allow checkpointing.
Throttling: cap parallelism so a large backfill does not starve live jobs of compute.

Watch outs

Be careful with late arriving dimensions and code that depends on values that have since changed, since naive reruns may silently produce different numbers than the original.

Key idea

Safe backfills need idempotent partitioned jobs run window by window, often into a validated shadow table swapped in atomically.

Backfill and Reprocessing

When you must rebuild history

Core requirements

Strategies

Watch outs

Key idea

Check yourself