The Spark Execution Model

How Spark turns transformations into a DAG of stages and tasks executed lazily across executors.

Lazy transformations

In Spark you describe a computation with transformations like map and filter that build a lineage graph. Nothing runs until an action like count or save is called. This lazy evaluation lets the engine optimize the whole plan before executing.

From DAG to stages

When an action fires, the driver compiles the lineage into a DAG and splits it into stages. A new stage begins wherever data must be shuffled across the network. Within a stage, work is narrow, meaning each partition depends on one parent partition.

Tasks and executors

Each stage becomes a set of tasks, one per partition. The driver schedules tasks onto executors, which are JVM processes holding cores and memory across the cluster. Executors run tasks in parallel and report results back.

Why stage boundaries matter

Narrow transformations pipeline cheaply within a stage. Wide transformations like joins and group by force a shuffle, which is the expensive boundary that defines stages and dominates runtime.

Key idea

Spark lazily builds a DAG, cuts it into stages at shuffle boundaries, and runs one task per partition on executors, so minimizing shuffles is the main performance lever.

The Spark Execution Model

Lazy transformations

From DAG to stages

Tasks and executors

Why stage boundaries matter

Key idea

Check yourself