Three storage philosophies
Analytical data has to live somewhere, and three patterns dominate. They differ in how structured the data must be before it lands.
- A data warehouse stores cleaned, structured tables optimized for SQL queries. Schema is defined up front, called schema on write, which makes queries fast but ingestion rigid.
- A data lake stores raw files of any shape in cheap object storage. Schema is applied only when you read, called schema on read, which is flexible but easy to turn into a swamp of junk.
- A lakehouse keeps data in open file formats on a lake but adds a table layer with transactions, schema, and indexes on top. It aims for warehouse reliability at lake cost.
How they relate
The lakehouse is a response to teams running both a lake and a warehouse and paying to copy data between them. By adding a metadata layer over lake files, you query raw and curated data in one place.
Key idea
Warehouses enforce schema on write for speed, lakes defer schema for flexibility, and the lakehouse adds a table layer to get both at once.