The Batch Processing Model

Processing large bounded datasets in scheduled jobs that read everything, compute, and write results.

What batch processing is

Batch processing runs computation over a bounded dataset that is fully available before the job starts. A scheduler kicks off the job, it reads the whole input, transforms it, and writes the output. Classic examples are nightly billing runs, daily report rollups, and ETL pipelines.

Core properties

High throughput is the goal, not low latency. A job can run for minutes or hours.
Bounded input means the system knows where the data ends, so it can compute exact totals.
Reproducibility is easy because the input is fixed. Rerunning the job on the same data gives the same result.

The trade off

Batch jobs are simple and reliable, but they add latency between when an event happens and when its effect appears. A sale at noon may not show in a report until the next morning.

Batch is ideal when freshness can wait and correctness over complete data matters most, like finance and analytics.

Key idea

Batch processing trades freshness for simplicity and exactness by computing over a complete bounded dataset in scheduled runs.

The Batch Processing Model

What batch processing is

Core properties

The trade off

Key idea

Check yourself