The two abstractions
An RDD, or resilient distributed dataset, is a partitioned, immutable collection with a record of how it was built. A DataFrame is a higher level table with named columns and a schema, built on the same engine.
How they execute
- Transformations are lazy: calling map or filter only records a plan, nothing runs.
- An action like count or write triggers execution of the whole chain at once.
- Spark remembers each step as lineage, so a lost partition is recomputed from its parents instead of needing replication.
Why DataFrames are faster
DataFrames pass through the Catalyst optimizer, which reorders filters, prunes columns, and chooses join strategies, much like a SQL planner. RDDs run your code as written with no such rewriting, so DataFrames usually win.
Prefer DataFrames for structured work and drop to RDDs only when you need low level control.
Key idea
Spark builds a lazy lineage of partitioned data, and DataFrames add a schema plus an optimizer that makes execution faster.