The Latent Diffusion
Latent diffusion makes high resolution image generation practical by moving the diffusion process out of pixel space and into a smaller learned latent space.
The two stage design
- An autoencoder first compresses images into a compact latent representation and can decode them back.
- A diffusion model then runs entirely in that latent space, adding and removing noise on small tensors instead of full images.
- A final decoder turns the generated latent back into a full resolution image.
Why it saves so much
- The latent is far smaller than the pixel grid, so each diffusion step is much cheaper.
- The autoencoder strips away imperceptible detail, letting the diffusion model focus on semantic structure.
- This is what made running powerful text to image models on a single consumer GPU feasible.
Conditioning
- Text or other conditions are injected through cross attention layers inside the latent denoiser.
- This is where prompts steer the generated latent before it is decoded.
Key idea
Latent diffusion compresses images with an autoencoder and runs the diffusion process in the small latent space, dramatically cutting compute while keeping high quality and enabling prompt conditioning through cross attention.