The problem of closing a window
In event time, events can arrive out of order or late. So when can a window safely close and emit its result? Waiting forever blocks output, but closing too early drops valid stragglers.
What a watermark is
A watermark is a marker that flows with the stream and asserts that no more events with a timestamp earlier than the watermark are expected. When the watermark passes the end of a window, the engine fires that window. Watermarks let the system make progress on event time despite disorder.
Handling late data
Some events still arrive after the watermark. The engine offers choices:
- Drop late events for simplicity.
- Allowed lateness keeps the window state a bit longer to update results.
- Route late events to a side output for separate handling.
The trade off
A conservative watermark waits longer, giving more correctness but higher latency. An aggressive watermark fires sooner but drops more late data.
Key idea
Watermarks estimate event time progress so windows can close on disordered streams, while a lateness policy decides how to treat events that still arrive too late.