Columnar Formats Parquet and ORC

Why analytics stores data by column, and how Parquet and ORC make scans fast and small.

Rows versus columns on disk

A row format stores each record together, which suits writing one row at a time. A columnar format stores all values of one column together. Analytics queries usually touch a few columns across many rows, so reading by column means you skip the columns you do not need entirely.

Why Parquet and ORC win

Parquet and ORC are open columnar file formats built for big data engines.

Compression is far better because values in one column are similar, so they compress tightly.
Column pruning lets the reader load only the columns a query references.
Predicate pushdown uses per chunk statistics like min and max to skip blocks that cannot match a filter.
Encoding such as dictionary and run length shrinks repeated values.

The trade off

Columnar formats are excellent for reads and aggregations but poor for single row updates and small frequent writes, which is why they pair with append style batch and lakehouse table layers.

Key idea

Parquet and ORC store data by column so analytics queries compress better and read only the columns and blocks they need, at the cost of slow single row updates.

Columnar Formats Parquet and ORC

Rows versus columns on disk

Why Parquet and ORC win

The trade off

Key idea

Check yourself