Events arrive out of order
In real systems an event created at one moment may reach the processor much later, after a network delay or a mobile device reconnects. So a window keyed on event time cannot simply close when the clock passes its end.
What a watermark is
A watermark is the system's claim that no more events with a timestamp below a certain point are expected. When the watermark passes a window's end, the engine assumes the window is complete and emits its result.
The accuracy latency tradeoff
- A conservative watermark waits longer, catching more late events but delaying results.
- An aggressive watermark emits sooner with lower latency but risks dropping records that arrive after it.
Handling late data
Events that arrive after the watermark are late. Options include:
- Allowed lateness, keeping window state a bit longer to update emitted results.
- Side output, routing late records to a separate stream for repair.
- Drop, accepting the loss for simplicity.
Key idea
A watermark estimates how far event time has progressed so windows can close, trading latency against completeness, while late data is handled with allowed lateness, side outputs, or drops.