Two complementary needs
A good recommender must both memorize known good combinations and generalize to unseen ones. The wide and deep model trains both jointly so each side covers the other.
The wide part
The wide side is a linear model over raw and crossed features. A cross feature like installed app paired with impression app lets the model memorize specific co occurrences seen in logs.
- Great at remembering exact rules.
- Cannot generalize to feature pairs it never saw.
The deep part
The deep side embeds sparse features into dense vectors and passes them through a feed forward network.
- Learns smooth generalizations across similar items.
- Can over generalize and recommend odd items when data is sparse.
Joint training
Both parts feed a shared output, and their losses are combined into one gradient step. The wide side uses a sparse optimizer while the deep side uses a standard one.
- The wide model patches the deep model when it over generalizes.
- The deep model fills gaps the wide model cannot reach.
Key idea
Wide and deep jointly trains a linear cross feature memorizer with a deep generalizer so the system both recalls exact patterns and extends to unseen combinations.