← Lessons

quiz vs the machine

Gold1390

System Design

File Formats Parquet ORC Avro

Picking columnar or row based storage formats for analytics and streaming.

5 min read · core · beat Gold to climb

Why format matters

The on disk file format strongly affects query speed, storage cost, and how easily data evolves. The main split is columnar versus row based layout.

Columnar formats

  • Parquet and ORC store values column by column.
  • Analytics often read a few columns from many rows, so columnar lets the engine skip unused columns.
  • They compress well because similar values sit together, and they keep min and max stats per chunk for fast filtering.

Row based format

  • Avro stores whole records together, which suits writing and streaming where you process one record at a time.
  • It carries a compact schema, making it strong for event pipelines and message passing.

Choosing

  • Use Parquet or ORC for analytical tables in a lake or warehouse.
  • Use Avro for streaming, ingestion, and row at a time workloads.

All three support schema evolution, letting you add fields without breaking old readers.

Key idea

Columnar Parquet and ORC win for analytics, while row based Avro fits streaming and record at a time pipelines.

Check yourself

Answer to earn rating on the learn ladder.

1. Why are Parquet and ORC good for analytics?

2. Which format best fits row at a time streaming?