QLoRA

Combining 4 bit quantization with LoRA to fine tune huge models on a single GPU.

Quantize then adapt

QLoRA combines two ideas: store the frozen base model in 4 bit quantized form, then train LoRA adapters on top. The base barely uses memory because it is compressed, and only the small adapters carry gradients, so even very large models fine tune on one GPU.

Key techniques

QLoRA introduced several tricks to keep 4 bit training stable:

NF4, a 4 bit data type shaped for the bell curve distribution of neural network weights.
Double quantization, which also quantizes the scaling constants to save more memory.
Paged optimizer memory that spills to host memory during spikes to avoid running out.

How gradients flow

The frozen 4 bit weights are dequantized on the fly during the forward and backward pass, but gradients update only the LoRA adapters, never the base. So the heavy weights stay compact while the tiny adapters learn.

Why it matters

QLoRA made fine tuning models with tens of billions of parameters possible on commodity hardware, broadening who can adapt large models.

Key idea

QLoRA keeps the base model in 4 bit NF4 and trains only LoRA adapters, letting huge models be fine tuned on a single GPU.

Quantize then adapt

Key techniques

How gradients flow

Why it matters

Key idea

Check yourself