← Lessons

quiz vs the machine

Gold1450

Machine Learning

QLoRA

Combining 4 bit quantization with LoRA to fine tune huge models on a single GPU.

5 min read · core · beat Gold to climb

Quantize then adapt

QLoRA combines two ideas: store the frozen base model in 4 bit quantized form, then train LoRA adapters on top. The base barely uses memory because it is compressed, and only the small adapters carry gradients, so even very large models fine tune on one GPU.

Key techniques

QLoRA introduced several tricks to keep 4 bit training stable:

  • NF4, a 4 bit data type shaped for the bell curve distribution of neural network weights.
  • Double quantization, which also quantizes the scaling constants to save more memory.
  • Paged optimizer memory that spills to host memory during spikes to avoid running out.

How gradients flow

The frozen 4 bit weights are dequantized on the fly during the forward and backward pass, but gradients update only the LoRA adapters, never the base. So the heavy weights stay compact while the tiny adapters learn.

Why it matters

QLoRA made fine tuning models with tens of billions of parameters possible on commodity hardware, broadening who can adapt large models.

Key idea

QLoRA keeps the base model in 4 bit NF4 and trains only LoRA adapters, letting huge models be fine tuned on a single GPU.

Check yourself

Answer to earn rating on the learn ladder.

1. In QLoRA, which parameters receive gradients?

2. What is NF4 designed for?

3. What does double quantization compress?