One matrix, two jobs
Weight tying uses the same matrix for the input embedding and the output unembedding, just transposed. The vector that represents a token going in is reused to score that token coming out.
Why it is reasonable
The embedding row for a token and the unembedding column for the same token both describe its position in vector space. Tying them asserts that the same notion of similarity should govern reading a token and predicting it.
The benefits
- It removes a large parameter block, since the vocabulary projection can dominate small models.
- The shared geometry tends to improve generalization and lower perplexity.
- Gradients from both the input and output paths refine the same representation.
The cautions
Tying forces the input and output spaces to be consistent, which is usually helpful but can be limiting if a model needs them to differ. Many large models still tie weights, while some very large ones keep them separate when parameters are cheap relative to capacity gains.
Key idea
Weight tying shares one transposed matrix between the embedding and unembedding, saving parameters and aligning the geometry of reading and predicting a token, which often improves generalization.