← Lessons

quiz vs the machine

Platinum1780

System Design

The Watermark and Late Data

How a watermark estimates progress so a window can close despite delayed events.

6 min read · advanced · beat Platinum to climb

Events arrive out of order

In real systems an event created at one moment may reach the processor much later, after a network delay or a mobile device reconnects. So a window keyed on event time cannot simply close when the clock passes its end.

What a watermark is

A watermark is the system's claim that no more events with a timestamp below a certain point are expected. When the watermark passes a window's end, the engine assumes the window is complete and emits its result.

The accuracy latency tradeoff

  • A conservative watermark waits longer, catching more late events but delaying results.
  • An aggressive watermark emits sooner with lower latency but risks dropping records that arrive after it.

Handling late data

Events that arrive after the watermark are late. Options include:

  • Allowed lateness, keeping window state a bit longer to update emitted results.
  • Side output, routing late records to a separate stream for repair.
  • Drop, accepting the loss for simplicity.

Key idea

A watermark estimates how far event time has progressed so windows can close, trading latency against completeness, while late data is handled with allowed lateness, side outputs, or drops.

Check yourself

Answer to earn rating on the learn ladder.

1. What does a watermark assert?

2. What is the tradeoff of an aggressive watermark?

3. How can a system handle an event that arrives after the watermark?