When you must rebuild history
A backfill recomputes past data, for example after fixing a transform bug or adding a new column. Reprocessing is the general act of rerunning historical windows. Doing this safely on production data is surprisingly hard.
Core requirements
- Idempotency so each historical window can be overwritten without duplicating.
- Partitioned data so you can target specific date ranges instead of the whole table.
- Decoupled runs so the backfill does not collide with the live daily pipeline writing the same tables.
Strategies
- Shadow tables: build the corrected data in a parallel table, validate it, then swap it in atomically.
- Window by window: process old partitions in chunks to bound resource use and allow checkpointing.
- Throttling: cap parallelism so a large backfill does not starve live jobs of compute.
Watch outs
Be careful with late arriving dimensions and code that depends on values that have since changed, since naive reruns may silently produce different numbers than the original.
Key idea
Safe backfills need idempotent partitioned jobs run window by window, often into a validated shadow table swapped in atomically.