Quantize then adapt
QLoRA combines two ideas: store the frozen base model in 4 bit quantized form, then train LoRA adapters on top. The base barely uses memory because it is compressed, and only the small adapters carry gradients, so even very large models fine tune on one GPU.
Key techniques
QLoRA introduced several tricks to keep 4 bit training stable:
- NF4, a 4 bit data type shaped for the bell curve distribution of neural network weights.
- Double quantization, which also quantizes the scaling constants to save more memory.
- Paged optimizer memory that spills to host memory during spikes to avoid running out.
How gradients flow
The frozen 4 bit weights are dequantized on the fly during the forward and backward pass, but gradients update only the LoRA adapters, never the base. So the heavy weights stay compact while the tiny adapters learn.
Why it matters
QLoRA made fine tuning models with tens of billions of parameters possible on commodity hardware, broadening who can adapt large models.
Key idea
QLoRA keeps the base model in 4 bit NF4 and trains only LoRA adapters, letting huge models be fine tuned on a single GPU.