Why idempotency
Pipelines fail and get retried. An idempotent job produces the same final state whether it runs once or many times. Without this, a retry can double count rows or corrupt totals.
Patterns that achieve it
- Overwrite by partition: instead of appending, fully replace the target partition for the processed window. Rerunning simply replaces the same partition again.
- Upsert with a key: merge on a stable business key so a repeated record updates rather than duplicates.
- Deterministic transforms: avoid relying on wall clock time or random values inside the logic, since they make output differ between runs.
Deduplication
When inputs may contain duplicate events, deduplicate on a unique event id, often keeping the latest by timestamp. This makes at least once delivery behave like exactly once.
Why it pays off
Idempotency makes backfills, retries, and recovery safe. You can rerun any window without fear, which dramatically simplifies operations.
Key idea
Make jobs idempotent through partition overwrite, keyed upserts, and deduplication so reruns never double count or corrupt data.