Data Skew Handling Deep Dive

When a few hot keys overload single tasks, and techniques like salting to spread the load.

What data skew is

Data skew happens when data is unevenly distributed across keys. After a shuffle, one partition may receive far more rows than the others because a few hot keys dominate. That single task runs long while the rest finish quickly, leaving the cluster idle and the job slow.

Why it is so damaging

A distributed job finishes only when its slowest task finishes. One overloaded partition becomes a straggler that stretches total runtime and can exhaust the memory of the node holding it.

Techniques to handle it

Salting appends a random suffix to hot keys so their rows spread across many partitions, then a second pass merges the partial results.
Isolate hot keys and process them with a separate broadcast or special path while normal keys take the usual route.
Adaptive splitting, where the engine detects a heavy partition and divides it at runtime.

The diagnostic

Skew shows as a few tasks with huge input while most are tiny. Watch task duration spread, not just averages.

Key idea

Data skew overloads single tasks with hot keys, and salting or isolating those keys spreads the load so the slowest task no longer dominates runtime.

Data Skew Handling Deep Dive

What data skew is

Why it is so damaging

Techniques to handle it

The diagnostic

Key idea

Check yourself