What batch processing is
Batch processing runs computation over a bounded dataset that is fully available before the job starts. A scheduler kicks off the job, it reads the whole input, transforms it, and writes the output. Classic examples are nightly billing runs, daily report rollups, and ETL pipelines.
Core properties
- High throughput is the goal, not low latency. A job can run for minutes or hours.
- Bounded input means the system knows where the data ends, so it can compute exact totals.
- Reproducibility is easy because the input is fixed. Rerunning the job on the same data gives the same result.
The trade off
Batch jobs are simple and reliable, but they add latency between when an event happens and when its effect appears. A sale at noon may not show in a report until the next morning.
Batch is ideal when freshness can wait and correctness over complete data matters most, like finance and analytics.
Key idea
Batch processing trades freshness for simplicity and exactness by computing over a complete bounded dataset in scheduled runs.