The huge output space problem
In tasks like recommendation or word embeddings, the set of possible items is enormous. Computing a probability over millions of items for every example is too expensive. Negative sampling trains on the true positive plus a small handful of randomly drawn negatives instead.
How it works
- For each observed positive pair, such as a user and an item they clicked, draw a few items they did not interact with as negatives.
- Train the model to score the positive high and the sampled negatives low.
- This replaces a costly full softmax with a cheap comparison against a few items.
Why it works
- The gradient from a few well chosen negatives approximates the gradient from the full set well enough to learn good representations.
- It turns an intractable computation into a small fixed cost per example.
Choosing negatives
- Uniform negatives are simple but may be too easy, since most random items are obviously irrelevant.
- Sampling proportional to item popularity gives more informative negatives.
- The hardest cases need deliberate selection, a topic of its own.
Key idea
Negative sampling trains on a positive plus a few sampled negatives, replacing an intractable full softmax with a cheap approximation that still learns strong representations.