Requirements
- Schedule one off and recurring jobs to run at a target time.
- Distribute work across many workers reliably.
- Retry failures and avoid running a job twice.
High level design
A scheduler tracks due jobs, hands them to a queue, and workers pull and execute them with status tracking.
- Job store: persists job definitions, schedules, and next run times.
- Dispatcher: scans for due jobs and pushes them onto a work queue.
- Workers: pull jobs, execute, and report results, with leases preventing double pickup.
Bottlenecks
- Duplicate runs: two workers grabbing one job is bad, so a lease with a timeout gives one worker exclusive ownership.
- Time accuracy: jobs should fire near their target, so the dispatcher polls frequently and indexes by next run time.
- Failure recovery: a crashed worker must not strand a job, so an expired lease lets another worker reclaim it.
Recurring jobs compute their next run time after each execution so the schedule continues without gaps.
Key idea
A job scheduler persists due times and uses leased queue pickups so jobs run near their target time, retry on failure, and avoid duplicate execution.