Why single host cron breaks
A traditional cron runs on one machine. If that host dies, every scheduled job silently stops. Running cron on many hosts instead causes the same job to fire many times. Neither is acceptable at scale.
The reliable pattern
- A leader owns the schedule, elected via a lock service so only one fires each job.
- Job runs are recorded as idempotent records keyed by job name and fire time.
- A worker claims a run record before executing, so a duplicate trigger finds it already claimed.
Handling missed windows
If the scheduler was down at a fire time, on recovery it must decide whether to backfill the missed run or skip it. The choice depends on the job: a report can backfill, a notification probably should not.
Key idea
Cron at scale elects one scheduler and keys runs by job and fire time so each scheduled job executes exactly once, with an explicit policy for missed windows.