← Lessons

quiz vs the machine

Gold1400

System Design

Columnar Formats Parquet and ORC

Why analytics stores data by column, and how Parquet and ORC make scans fast and small.

5 min read · core · beat Gold to climb

Rows versus columns on disk

A row format stores each record together, which suits writing one row at a time. A columnar format stores all values of one column together. Analytics queries usually touch a few columns across many rows, so reading by column means you skip the columns you do not need entirely.

Why Parquet and ORC win

Parquet and ORC are open columnar file formats built for big data engines.

  • Compression is far better because values in one column are similar, so they compress tightly.
  • Column pruning lets the reader load only the columns a query references.
  • Predicate pushdown uses per chunk statistics like min and max to skip blocks that cannot match a filter.
  • Encoding such as dictionary and run length shrinks repeated values.

The trade off

Columnar formats are excellent for reads and aggregations but poor for single row updates and small frequent writes, which is why they pair with append style batch and lakehouse table layers.

Key idea

Parquet and ORC store data by column so analytics queries compress better and read only the columns and blocks they need, at the cost of slow single row updates.

Check yourself

Answer to earn rating on the learn ladder.

1. Why do columnar formats compress better than row formats?

2. What is predicate pushdown in Parquet or ORC?

3. What is a weakness of columnar formats?