← Lessons

quiz vs the machine

Platinum1800

System Design

Backfilling Pipelines Safely

Reprocessing historical data to fix bugs or fill gaps without breaking live consumers.

5 min read · advanced · beat Platinum to climb

Why backfill

When you fix a transformation bug, add a column, or recover from an outage, you must reprocess past data. A backfill reruns the pipeline over historical partitions. Done carelessly it duplicates rows, overloads systems, or serves half corrected data to users.

Principles for safety

  • Idempotent partitions are the foundation. Each backfilled date must overwrite its partition, so rerunning replaces rather than appends.
  • Bounded parallelism processes a window of past dates without flooding the warehouse or source, throttling to protect live workloads.
  • Shadow then swap writes corrected output to a side location, validates it, then atomically swaps it in, so consumers never see partial results.
  • Watermark the progress so a failed backfill can resume from the last completed partition rather than starting over.

Validate before exposing

Compare the backfilled output against expectations and the old data before promoting it. Only swap once the new history passes its quality gates.

Key idea

Safe backfills overwrite idempotent partitions with bounded parallelism, write to a validated shadow location, then atomically swap, so live consumers never see partial or duplicated data.

Check yourself

Answer to earn rating on the learn ladder.

1. Why is idempotent partition overwrite the foundation of safe backfills?

2. What does the shadow then swap pattern achieve?

3. Why use bounded parallelism during a backfill?