Spark RDD and DataFrame

The two abstractions

An RDD, or resilient distributed dataset, is a partitioned, immutable collection with a record of how it was built. A DataFrame is a higher level table with named columns and a schema, built on the same engine.

How they execute

Transformations are lazy: calling map or filter only records a plan, nothing runs.
An action like count or write triggers execution of the whole chain at once.
Spark remembers each step as lineage, so a lost partition is recomputed from its parents instead of needing replication.

Why DataFrames are faster

DataFrames pass through the Catalyst optimizer, which reorders filters, prunes columns, and chooses join strategies, much like a SQL planner. RDDs run your code as written with no such rewriting, so DataFrames usually win.

Prefer DataFrames for structured work and drop to RDDs only when you need low level control.

Key idea

Spark builds a lazy lineage of partitioned data, and DataFrames add a schema plus an optimizer that makes execution faster.

Spark RDD and DataFrame

The two abstractions

How they execute

Why DataFrames are faster

Key idea

Check yourself