Lazy transformations
In Spark you describe a computation with transformations like map and filter that build a lineage graph. Nothing runs until an action like count or save is called. This lazy evaluation lets the engine optimize the whole plan before executing.
From DAG to stages
When an action fires, the driver compiles the lineage into a DAG and splits it into stages. A new stage begins wherever data must be shuffled across the network. Within a stage, work is narrow, meaning each partition depends on one parent partition.
Tasks and executors
Each stage becomes a set of tasks, one per partition. The driver schedules tasks onto executors, which are JVM processes holding cores and memory across the cluster. Executors run tasks in parallel and report results back.
Why stage boundaries matter
Narrow transformations pipeline cheaply within a stage. Wide transformations like joins and group by force a shuffle, which is the expensive boundary that defines stages and dominates runtime.
Key idea
Spark lazily builds a DAG, cuts it into stages at shuffle boundaries, and runs one task per partition on executors, so minimizing shuffles is the main performance lever.