Rows versus columns on disk
A row format stores each record together, which suits writing one row at a time. A columnar format stores all values of one column together. Analytics queries usually touch a few columns across many rows, so reading by column means you skip the columns you do not need entirely.
Why Parquet and ORC win
Parquet and ORC are open columnar file formats built for big data engines.
- Compression is far better because values in one column are similar, so they compress tightly.
- Column pruning lets the reader load only the columns a query references.
- Predicate pushdown uses per chunk statistics like min and max to skip blocks that cannot match a filter.
- Encoding such as dictionary and run length shrinks repeated values.
The trade off
Columnar formats are excellent for reads and aggregations but poor for single row updates and small frequent writes, which is why they pair with append style batch and lakehouse table layers.
Key idea
Parquet and ORC store data by column so analytics queries compress better and read only the columns and blocks they need, at the cost of slow single row updates.