Why backfill
When you fix a transformation bug, add a column, or recover from an outage, you must reprocess past data. A backfill reruns the pipeline over historical partitions. Done carelessly it duplicates rows, overloads systems, or serves half corrected data to users.
Principles for safety
- Idempotent partitions are the foundation. Each backfilled date must overwrite its partition, so rerunning replaces rather than appends.
- Bounded parallelism processes a window of past dates without flooding the warehouse or source, throttling to protect live workloads.
- Shadow then swap writes corrected output to a side location, validates it, then atomically swaps it in, so consumers never see partial results.
- Watermark the progress so a failed backfill can resume from the last completed partition rather than starting over.
Validate before exposing
Compare the backfilled output against expectations and the old data before promoting it. Only swap once the new history passes its quality gates.
Key idea
Safe backfills overwrite idempotent partitions with bounded parallelism, write to a validated shadow location, then atomically swap, so live consumers never see partial or duplicated data.