The motivation
Full fine tuning of a giant model updates billions of weights, which is expensive and produces a huge copy per task. LoRA, short for low rank adaptation, makes this cheap by training only small add on matrices.
The low rank trick
Instead of changing the original weight matrix, LoRA learns a small update expressed as the product of two thin matrices.
- The big pretrained weights stay frozen
- A pair of low rank matrices learns the change
- Their product is added to the frozen weights at inference
- Only these small matrices are trained and saved
Because the update has low rank, it needs a tiny fraction of the parameters, often well under one percent.
Why it is popular
LoRA adapters are small files you can swap per task while sharing one base model. This is a form of parameter efficient fine tuning. A common variant called QLoRA combines LoRA with quantization so even very large models can be tuned on a single GPU.
Key idea
LoRA freezes the base model and trains tiny low rank matrices as the weight update, enabling cheap swappable fine tuning of large models.