The architecture
The two tower model has a user tower and an item tower, each a neural network. The user tower turns user features into an embedding; the item tower turns item features into an embedding in the same space. The score is the dot product of the two embeddings.
Why the towers stay separate
Because the towers never share layers, item embeddings depend only on item features. That means you can precompute every item embedding once and store them in an index. At serving time you only compute the user embedding, then do a fast nearest neighbor search.
Training
- Build batches of positive user item pairs from interactions.
- Use in batch negatives: other items in the batch act as negatives, which is cheap and effective.
- Optimize a contrastive or softmax loss so the true item scores higher than negatives.
Where it fits
The two tower model is the workhorse of candidate generation. Its separability is exactly what makes billion scale retrieval possible. It cannot model rich user item cross features, so a heavier ranker handles those later.
Key idea
The two tower model encodes users and items separately into one space and scores by dot product, so item embeddings precompute for fast nearest neighbor retrieval.