← Lessons

quiz vs the machine

Platinum1750

System Design

Design a Distributed Job Scheduler

Run jobs at the right time across workers with retries and exactly once intent.

7 min read · advanced · beat Platinum to climb

Requirements

  • Schedule one off and recurring jobs to run at a target time.
  • Distribute work across many workers reliably.
  • Retry failures and avoid running a job twice.

High level design

A scheduler tracks due jobs, hands them to a queue, and workers pull and execute them with status tracking.

  • Job store: persists job definitions, schedules, and next run times.
  • Dispatcher: scans for due jobs and pushes them onto a work queue.
  • Workers: pull jobs, execute, and report results, with leases preventing double pickup.

Bottlenecks

  • Duplicate runs: two workers grabbing one job is bad, so a lease with a timeout gives one worker exclusive ownership.
  • Time accuracy: jobs should fire near their target, so the dispatcher polls frequently and indexes by next run time.
  • Failure recovery: a crashed worker must not strand a job, so an expired lease lets another worker reclaim it.

Recurring jobs compute their next run time after each execution so the schedule continues without gaps.

Key idea

A job scheduler persists due times and uses leased queue pickups so jobs run near their target time, retry on failure, and avoid duplicate execution.

Check yourself

Answer to earn rating on the learn ladder.

1. How does a lease prevent two workers running the same job?

2. How does a recurring job keep its schedule going?