Requirements
- Record every ad click and aggregate counts per ad over time windows.
- Provide near real time dashboards and accurate billing totals.
- Handle massive click volume and avoid double counting.
High level design
A streaming pipeline ingests clicks and aggregates them into time windows.
- Ingestion: click events land in a partitioned log such as Kafka, partitioned by ad id.
- Stream processing: a stream job aggregates counts per ad per time window using event time and watermarks.
- Storage: write rolled up counts to an analytics store for dashboards and a durable store for billing.
- Dedup: attach a click id and deduplicate to avoid counting retries twice.
Bottlenecks
- Late events: watermarks decide when a window is complete while tolerating some lateness.
- Exactly once: idempotent writes keyed by click id prevent double counting under retries.
- Hot ads: partition by ad id and pre aggregate to spread skew.
Tradeoffs
- Larger windows reduce overhead but increase result latency.
- Strict exactly once costs more coordination than approximate at least once.
Key idea
An ad click aggregator is a partitioned streaming pipeline that windows events by event time, deduplicates by click id, and rolls up counts for dashboards and billing.