The Negative Sampling

The huge output space problem

In tasks like recommendation or word embeddings, the set of possible items is enormous. Computing a probability over millions of items for every example is too expensive. Negative sampling trains on the true positive plus a small handful of randomly drawn negatives instead.

How it works

For each observed positive pair, such as a user and an item they clicked, draw a few items they did not interact with as negatives.
Train the model to score the positive high and the sampled negatives low.
This replaces a costly full softmax with a cheap comparison against a few items.

Why it works

The gradient from a few well chosen negatives approximates the gradient from the full set well enough to learn good representations.
It turns an intractable computation into a small fixed cost per example.

Choosing negatives

Uniform negatives are simple but may be too easy, since most random items are obviously irrelevant.
Sampling proportional to item popularity gives more informative negatives.
The hardest cases need deliberate selection, a topic of its own.

Key idea

Negative sampling trains on a positive plus a few sampled negatives, replacing an intractable full softmax with a cheap approximation that still learns strong representations.

The Negative Sampling

The huge output space problem

How it works

Why it works

Choosing negatives

Key idea

Check yourself