The problem with full fine tuning
Fine tuning every weight of a large model needs huge memory for the weights, their gradients, and optimizer state. LoRA, low rank adaptation, avoids this by freezing the original weights and training only small add on matrices.
Low rank updates
LoRA assumes the change a task needs is low rank, meaning it lives in a small subspace. For a weight matrix it learns two thin matrices whose product forms the update:
- Keep the original weights frozen.
- Add a learned update equal to the product of two small matrices.
- Only those small matrices receive gradients.
Because the two matrices are tiny compared to the full weight, the number of trainable parameters drops by orders of magnitude.
Practical benefits
- Far less memory, so fine tuning fits on smaller GPUs.
- Many task specific adapters can share one frozen base model.
- Adapters can be merged into the weights at inference for no extra cost.
Key idea
LoRA freezes the base model and learns a low rank update from two small matrices, cutting trainable parameters dramatically while keeping quality.