← Lessons

quiz vs the machine

Gold1450

System Design

Spark RDD and DataFrame

Lazy lineage of partitioned data and the optimized relational layer above it.

5 min read · core · beat Gold to climb

The two abstractions

An RDD, or resilient distributed dataset, is a partitioned, immutable collection with a record of how it was built. A DataFrame is a higher level table with named columns and a schema, built on the same engine.

How they execute

  • Transformations are lazy: calling map or filter only records a plan, nothing runs.
  • An action like count or write triggers execution of the whole chain at once.
  • Spark remembers each step as lineage, so a lost partition is recomputed from its parents instead of needing replication.

Why DataFrames are faster

DataFrames pass through the Catalyst optimizer, which reorders filters, prunes columns, and chooses join strategies, much like a SQL planner. RDDs run your code as written with no such rewriting, so DataFrames usually win.

Prefer DataFrames for structured work and drop to RDDs only when you need low level control.

Key idea

Spark builds a lazy lineage of partitioned data, and DataFrames add a schema plus an optimizer that makes execution faster.

Check yourself

Answer to earn rating on the learn ladder.

1. What triggers actual computation in Spark?

2. Why are DataFrames usually faster than RDDs?

3. How does Spark recover a lost partition?