What data skew is
Data skew happens when data is unevenly distributed across keys. After a shuffle, one partition may receive far more rows than the others because a few hot keys dominate. That single task runs long while the rest finish quickly, leaving the cluster idle and the job slow.
Why it is so damaging
A distributed job finishes only when its slowest task finishes. One overloaded partition becomes a straggler that stretches total runtime and can exhaust the memory of the node holding it.
Techniques to handle it
- Salting appends a random suffix to hot keys so their rows spread across many partitions, then a second pass merges the partial results.
- Isolate hot keys and process them with a separate broadcast or special path while normal keys take the usual route.
- Adaptive splitting, where the engine detects a heavy partition and divides it at runtime.
The diagnostic
Skew shows as a few tasks with huge input while most are tiny. Watch task duration spread, not just averages.
Key idea
Data skew overloads single tasks with hot keys, and salting or isolating those keys spreads the load so the slowest task no longer dominates runtime.