Event time versus processing time
Streams care about two clocks. Event time is when something actually happened, and processing time is when the system saw it. Network delays, mobile devices offline, and retries mean events often arrive out of order and late, long after their event time has passed.
What a watermark does
A watermark is the processor saying I believe I have seen all events up to time T. It lets the system close time windows and emit results. The watermark trails behind the newest event time by an allowed lateness gap that you tune.
- Set the gap too small and you emit results before late events arrive, producing wrong counts.
- Set the gap too large and results are correct but delayed, holding state and memory longer.
Handling stragglers
Events later than the watermark are either dropped, sent to a side output, or used to update an already emitted window if your engine supports it.
Key idea
A watermark estimates that all events up to a time have arrived, trading a tunable lateness gap between fast results and correct results for out of order streams.