The symptom
In a distributed job most tasks finish quickly while a few run for ages. This is data skew: one partition or key holds far more data than the others, so its task becomes a straggler that sets the job runtime.
Why it happens
- A few hot keys dominate, like one popular product in a sales table.
- Hash partitioning sends all rows for a key to one reducer, so a skewed key cannot be split.
- Null or default values pile into a single bucket.
Fixes
- Salting appends a random suffix to the hot key so its rows spread across many tasks, then a second pass re combines the partial results.
- Isolated handling detects hot keys and processes them with a dedicated strategy, like a broadcast, while normal keys take the usual path.
- Adaptive execution in modern engines splits oversized partitions at runtime.
The goal is always to break the dominance of one key so work spreads evenly and no single task dictates the runtime.
Key idea
Data skew makes one hot key overload a single task, and fixes like salting and isolated handling spread that work across the cluster.