Why two towers
Scoring every item with a deep network for every request is far too slow at web scale. The two tower model splits the work into a user tower and an item tower that produce embeddings independently, so item vectors can be precomputed and searched fast.
The structure
- The user tower encodes the user and context into a vector.
- The item tower encodes item features into a vector in the same space.
- Relevance is the dot product or cosine of the two vectors.
Because the towers never mix until the final dot product, all item vectors live in an index built ahead of time.
Retrieval at serving time
- Embed the user once per request.
- Run approximate nearest neighbor search over the item index.
- Return the top few hundred candidates in milliseconds.
Training tricks
- Use in batch negatives, treating other items in the batch as negatives, which is cheap.
- Apply a logQ correction to offset popular items that appear as negatives too often.
Key idea
Two tower models encode users and items into a shared space independently so item vectors can be indexed and retrieved by fast nearest neighbor search, trained efficiently with in batch negatives.