Idempotent Data Pipelines

Designing jobs so that running them twice produces the same result as running once.

Why reruns happen

Pipelines fail and get retried, schedulers fire twice, and engineers rerun yesterday to fix a bug. If a rerun double counts rows or appends duplicates, your data becomes wrong and hard to trust. An idempotent job produces the same final state no matter how many times it runs for the same input.

How to make a job idempotent

Overwrite by partition instead of appending. A job that owns a date partition should delete and rewrite that partition, so rerunning replaces rather than adds.
Use deterministic keys. Derive a stable identifier for each output row so a second write upserts onto the same row instead of inserting a copy.
Make reads bounded. Each run should target a fixed window of input, never just everything new since last time, which changes between runs.

The contract

The pipeline becomes a pure function from a partition of input to a partition of output. Reruns are safe, backfills are safe, and retries are safe.

Key idea

Idempotent pipelines overwrite fixed partitions with deterministic keys so that retries, double schedules, and backfills never duplicate or corrupt data.

Idempotent Data Pipelines

Why reruns happen

How to make a job idempotent

The contract

Key idea

Check yourself