Two storage philosophies
- A data warehouse stores cleaned structured data in a fixed schema, optimized for fast analytical SQL. You define the schema before loading, so this is schema on write.
- A data lake stores raw data of any shape in cheap object storage. You impose structure only when you read, so this is schema on read.
Tradeoffs
- Warehouse: fast queries and strong governance, but ingestion is rigid and storing everything is expensive.
- Lake: cheap, flexible, and keeps raw fidelity for unknown future uses, but querying is slower and quality can rot into a data swamp without governance.
The lakehouse
A lakehouse combines both. It keeps data in open columnar files on object storage but adds a table layer that brings transactions, schema enforcement, and warehouse like query speed over the lake.
When to use which
- Use a warehouse for trusted business reporting on well defined metrics.
- Use a lake for raw logs, machine learning features, and exploratory data whose schema is not yet fixed.
Key idea
A warehouse enforces schema on write for fast governed analytics while a lake stores raw data for schema on read flexibility, and the lakehouse pattern blends the two on cheap object storage.