← Lessons

quiz vs the machine

Platinum1800

System Design

Data Skew Handling Deep Dive

When a few hot keys overload single tasks, and techniques like salting to spread the load.

6 min read · advanced · beat Platinum to climb

What data skew is

Data skew happens when data is unevenly distributed across keys. After a shuffle, one partition may receive far more rows than the others because a few hot keys dominate. That single task runs long while the rest finish quickly, leaving the cluster idle and the job slow.

Why it is so damaging

A distributed job finishes only when its slowest task finishes. One overloaded partition becomes a straggler that stretches total runtime and can exhaust the memory of the node holding it.

Techniques to handle it

  • Salting appends a random suffix to hot keys so their rows spread across many partitions, then a second pass merges the partial results.
  • Isolate hot keys and process them with a separate broadcast or special path while normal keys take the usual route.
  • Adaptive splitting, where the engine detects a heavy partition and divides it at runtime.

The diagnostic

Skew shows as a few tasks with huge input while most are tiny. Watch task duration spread, not just averages.

Key idea

Data skew overloads single tasks with hot keys, and salting or isolating those keys spreads the load so the slowest task no longer dominates runtime.

Check yourself

Answer to earn rating on the learn ladder.

1. Why does skew slow a whole job?

2. What does salting do?

3. How do you diagnose skew?