Why format matters
The on disk file format strongly affects query speed, storage cost, and how easily data evolves. The main split is columnar versus row based layout.
Columnar formats
- Parquet and ORC store values column by column.
- Analytics often read a few columns from many rows, so columnar lets the engine skip unused columns.
- They compress well because similar values sit together, and they keep min and max stats per chunk for fast filtering.
Row based format
- Avro stores whole records together, which suits writing and streaming where you process one record at a time.
- It carries a compact schema, making it strong for event pipelines and message passing.
Choosing
- Use Parquet or ORC for analytical tables in a lake or warehouse.
- Use Avro for streaming, ingestion, and row at a time workloads.
All three support schema evolution, letting you add fields without breaking old readers.
Key idea
Columnar Parquet and ORC win for analytics, while row based Avro fits streaming and record at a time pipelines.