← Lessons

quiz vs the machine

Gold1420

System Design

Idempotent Data Pipelines

Designing jobs so that running them twice produces the same result as running once.

5 min read · core · beat Gold to climb

Why reruns happen

Pipelines fail and get retried, schedulers fire twice, and engineers rerun yesterday to fix a bug. If a rerun double counts rows or appends duplicates, your data becomes wrong and hard to trust. An idempotent job produces the same final state no matter how many times it runs for the same input.

How to make a job idempotent

  • Overwrite by partition instead of appending. A job that owns a date partition should delete and rewrite that partition, so rerunning replaces rather than adds.
  • Use deterministic keys. Derive a stable identifier for each output row so a second write upserts onto the same row instead of inserting a copy.
  • Make reads bounded. Each run should target a fixed window of input, never just everything new since last time, which changes between runs.

The contract

The pipeline becomes a pure function from a partition of input to a partition of output. Reruns are safe, backfills are safe, and retries are safe.

Key idea

Idempotent pipelines overwrite fixed partitions with deterministic keys so that retries, double schedules, and backfills never duplicate or corrupt data.

Check yourself

Answer to earn rating on the learn ladder.

1. What does an idempotent pipeline guarantee?

2. Which technique helps make a job idempotent?

3. Why should each run target a fixed input window?