Why reruns happen
Pipelines fail and get retried, schedulers fire twice, and engineers rerun yesterday to fix a bug. If a rerun double counts rows or appends duplicates, your data becomes wrong and hard to trust. An idempotent job produces the same final state no matter how many times it runs for the same input.
How to make a job idempotent
- Overwrite by partition instead of appending. A job that owns a date partition should delete and rewrite that partition, so rerunning replaces rather than adds.
- Use deterministic keys. Derive a stable identifier for each output row so a second write upserts onto the same row instead of inserting a copy.
- Make reads bounded. Each run should target a fixed window of input, never just everything new since last time, which changes between runs.
The contract
The pipeline becomes a pure function from a partition of input to a partition of output. Reruns are safe, backfills are safe, and retries are safe.
Key idea
Idempotent pipelines overwrite fixed partitions with deterministic keys so that retries, double schedules, and backfills never duplicate or corrupt data.