Weight Tying

One matrix, two jobs

Weight tying uses the same matrix for the input embedding and the output unembedding, just transposed. The vector that represents a token going in is reused to score that token coming out.

Why it is reasonable

The embedding row for a token and the unembedding column for the same token both describe its position in vector space. Tying them asserts that the same notion of similarity should govern reading a token and predicting it.

The benefits

It removes a large parameter block, since the vocabulary projection can dominate small models.
The shared geometry tends to improve generalization and lower perplexity.
Gradients from both the input and output paths refine the same representation.

The cautions

Tying forces the input and output spaces to be consistent, which is usually helpful but can be limiting if a model needs them to differ. Many large models still tie weights, while some very large ones keep them separate when parameters are cheap relative to capacity gains.

Key idea

Weight tying shares one transposed matrix between the embedding and unembedding, saving parameters and aligning the geometry of reading and predicting a token, which often improves generalization.

One matrix, two jobs

Why it is reasonable

The benefits

The cautions

Key idea

Check yourself