Requirements
- Deliver messages in near real time and store them durably.
- Support presence, delivery receipts, and group chats.
- Work when a recipient is offline and reconnects later.
High level design
Clients hold a persistent connection to a gateway. Messages flow through a service that persists then routes them.
- Connection layer: long lived websocket connections terminate at stateless gateways tracked in a session registry.
- Message service: writes each message to durable storage and looks up where the recipient is connected.
- Routing: if the recipient is online, push over their gateway. If offline, store and deliver on reconnect.
Bottlenecks
- Connection state: millions of open sockets need a registry mapping user to gateway so routing finds the right node.
- Ordering: per conversation sequence numbers keep messages in order even across retries.
- Group fan out: a message to a large group expands to many recipients, so fan out asynchronously through a queue.
Offline users get messages on reconnect by reading from the durable store from their last acknowledged sequence number.
Key idea
A chat system is a persistent connection plus a durable log, where a session registry routes live messages and stored sequences handle offline catch up.