Two opposite processes
A diffusion model defines a forward process that slowly adds Gaussian noise to an image over many steps until it is pure noise. Generation learns the reverse process that removes noise step by step to recover a clean image.
What the network predicts
At each step a network, often a UNet, looks at a noisy image and a timestep and predicts the noise that was added. Subtracting a portion of that predicted noise gives a slightly cleaner image. Repeating from pure noise yields a sample.
Why this is stable
Each step is a small, well posed denoising task. Compared to adversarial training, the objective is a simple regression, so training is stable and mode coverage is strong, capturing diverse outputs.
Conditioning and latents
To steer generation, the network is conditioned on text or other signals, often through cross attention. To save compute, latent diffusion runs the whole process in a compressed latent space rather than on raw pixels, then decodes once at the end.
The cost
The main drawback is slow sampling because many denoising steps are needed. Faster samplers and distillation reduce the step count to make generation practical.
Key idea
Diffusion models learn to reverse a gradual noising process by repeatedly predicting and removing noise, giving stable training and diverse samples, with latent space and faster samplers cutting the heavy sampling cost.