← Lessons

quiz vs the machine

Platinum1780

System Design

Idempotent Pipeline Design

Designing jobs so reruns produce the same result without duplicates.

5 min read · advanced · beat Platinum to climb

Why idempotency

Pipelines fail and get retried. An idempotent job produces the same final state whether it runs once or many times. Without this, a retry can double count rows or corrupt totals.

Patterns that achieve it

  • Overwrite by partition: instead of appending, fully replace the target partition for the processed window. Rerunning simply replaces the same partition again.
  • Upsert with a key: merge on a stable business key so a repeated record updates rather than duplicates.
  • Deterministic transforms: avoid relying on wall clock time or random values inside the logic, since they make output differ between runs.

Deduplication

When inputs may contain duplicate events, deduplicate on a unique event id, often keeping the latest by timestamp. This makes at least once delivery behave like exactly once.

Why it pays off

Idempotency makes backfills, retries, and recovery safe. You can rerun any window without fear, which dramatically simplifies operations.

Key idea

Make jobs idempotent through partition overwrite, keyed upserts, and deduplication so reruns never double count or corrupt data.

Check yourself

Answer to earn rating on the learn ladder.

1. What does an idempotent pipeline guarantee?

2. Which pattern makes appending data idempotent?

3. Why avoid wall clock time inside transform logic?