Checkpointing And Savepoints

Periodic automatic snapshots for fault recovery versus deliberate snapshots for upgrades and migrations.

Why snapshots exist

A long running stream job holds large state. If a node crashes, the job must restore that state and resume without losing or double counting. The mechanism is a consistent snapshot of all operator state plus the source positions.

Checkpoints

A checkpoint is an automatic, periodic snapshot the engine takes for fault tolerance. It is optimized for low overhead and may be stored in an internal format. On failure the job rolls back to the last completed checkpoint and replays from the recorded source offsets. Old checkpoints are pruned automatically.

Savepoints

A savepoint is a manually triggered snapshot meant for operational changes, such as upgrading job code, rescaling parallelism, or migrating clusters. It uses a stable, portable format and is retained until you delete it.

The key distinction

Both capture consistent state, but checkpoints serve automatic recovery while savepoints serve planned, human driven lifecycle operations.

Key idea

Checkpoints are automatic periodic snapshots for crash recovery, while savepoints are deliberate portable snapshots for upgrades, rescaling, and migrations.

Checkpointing And Savepoints

Why snapshots exist

Checkpoints

Savepoints

The key distinction

Key idea

Check yourself