The Latent Diffusion

Run diffusion in a compressed latent space to slash compute while keeping quality.

The Latent Diffusion

Latent diffusion makes high resolution image generation practical by moving the diffusion process out of pixel space and into a smaller learned latent space.

The two stage design

An autoencoder first compresses images into a compact latent representation and can decode them back.
A diffusion model then runs entirely in that latent space, adding and removing noise on small tensors instead of full images.
A final decoder turns the generated latent back into a full resolution image.

Why it saves so much

The latent is far smaller than the pixel grid, so each diffusion step is much cheaper.
The autoencoder strips away imperceptible detail, letting the diffusion model focus on semantic structure.
This is what made running powerful text to image models on a single consumer GPU feasible.

Conditioning

Text or other conditions are injected through cross attention layers inside the latent denoiser.
This is where prompts steer the generated latent before it is decoded.

Key idea

Latent diffusion compresses images with an autoencoder and runs the diffusion process in the small latent space, dramatically cutting compute while keeping high quality and enabling prompt conditioning through cross attention.

The Latent Diffusion

The Latent Diffusion

The two stage design

Why it saves so much

Conditioning

Key idea

Check yourself