Encoding Per Column
Columnar warehouses pick an encoding per column based on its data. Encoding turns values into a compact form that is fast to scan, often before a general compressor runs on top.
Common Encodings
- Run length: store a value plus how many times it repeats, great for sorted or low change columns.
- Dictionary: map distinct values to small integer codes, ideal for low cardinality text like country.
- Delta: store differences between neighbors, ideal for sorted timestamps or IDs.
- Bit packing: use only as many bits as the value range needs.
Why It Matters
Smaller data means less IO and less network movement, the dominant cost in big scans. Some encodings also let the engine operate on compressed data directly, such as filtering dictionary codes without decoding.
Key idea
Per column encodings like run length, dictionary, and delta exploit the shape of each column to shrink storage and IO, sometimes letting queries run directly on compressed data.