← Lessons

quiz vs the machine

Gold1480

System Design

The Spark Execution Model

How Spark turns transformations into a DAG of stages and tasks executed lazily across executors.

6 min read · core · beat Gold to climb

Lazy transformations

In Spark you describe a computation with transformations like map and filter that build a lineage graph. Nothing runs until an action like count or save is called. This lazy evaluation lets the engine optimize the whole plan before executing.

From DAG to stages

When an action fires, the driver compiles the lineage into a DAG and splits it into stages. A new stage begins wherever data must be shuffled across the network. Within a stage, work is narrow, meaning each partition depends on one parent partition.

Tasks and executors

Each stage becomes a set of tasks, one per partition. The driver schedules tasks onto executors, which are JVM processes holding cores and memory across the cluster. Executors run tasks in parallel and report results back.

Why stage boundaries matter

Narrow transformations pipeline cheaply within a stage. Wide transformations like joins and group by force a shuffle, which is the expensive boundary that defines stages and dominates runtime.

Key idea

Spark lazily builds a DAG, cuts it into stages at shuffle boundaries, and runs one task per partition on executors, so minimizing shuffles is the main performance lever.

Check yourself

Answer to earn rating on the learn ladder.

1. When does a Spark job actually start computing?

2. What marks the boundary between two Spark stages?

3. How many tasks does a stage create?