The Cron Scheduler at Scale

Recurring Work

Nightly reports, hourly syncs, and per minute health rollups all repeat on a schedule. A single machine running the unix cron daemon is simple but fragile. If that machine is down at the trigger time, the run is silently missed.

Make It Highly Available

Run the scheduler on several nodes for redundancy. But now two nodes might both fire the same schedule, causing a duplicate run. Coordinate them:

Leader election so only the elected node fires schedules.
Distributed lock keyed by schedule name and fire time, so only the first node to grab the lock runs it.

Exactly One Firing

Key the lock or dedup record on the schedule plus the scheduled minute, not the wall clock arrival. Then even if two nodes wake at slightly different moments, they compete for the same slot and only one wins.

Catching Up

Decide what happens after downtime. Should a missed nightly run catch up when the scheduler returns, or skip to the next slot? For idempotent reports, catch up. For time sensitive alerts, skipping is safer.

Separate Trigger From Work

The scheduler should only enqueue a job, not run it. Workers do the heavy lifting, so a slow job never delays the next firing.

Key idea

A scalable cron scheduler runs redundantly but uses a lock keyed on the scheduled time so each occurrence fires exactly once, only enqueuing work.