You cannot fix what you cannot see
A complete design includes how you would observe the system in production. Mentioning monitoring and alerting signals operational maturity, that you think about running the system, not just building it.
The pillars of observability
- Metrics are numbers over time like latency and error rate.
- Logs are detailed records of individual events.
- Traces follow one request across many services.
What to watch
- The golden signals of latency, traffic, errors, and saturation.
- Business metrics like sign ups or orders per minute.
- Dependency health for databases and downstream services.
Alert with care
Alerts should fire on symptoms users feel, like rising error rate or latency, not on every internal blip. Too many alerts cause fatigue and missed pages. State that you would alert on the golden signals and tie thresholds to your latency and availability goals.
Key idea
Round out a design with metrics, logs, and traces, and alert on user facing symptoms tied to your latency and availability goals.