Diffusion Model
Starts with noise and finds the image inside — like a Patronus forming from darkness, but the spell is a neural network.
Diffusion models are the dominant architecture for state-of-the-art image and video generation. The training process is elegantly simple in concept: take an image, add a small amount of random noise, then a bit more, continuing until the image is completely indistinguishable from pure noise. Train a neural network to reverse this process — given a slightly noisy version of an image, predict what the less-noisy version should look like. Repeated across billions of image-noise pairs at every noise level, the model learns the underlying structure of what images look like — essentially learning the "shape" of the space of real images. At generation time, start with pure random noise and repeatedly apply the learned denoising step, with the text prompt guiding which "direction" in image space the denoising should proceed, until coherent content emerges.
Latent diffusion models (the architecture behind Stable Diffusion, DALL-E 3, and most modern systems) perform the diffusion process not directly on pixels but in a compressed "latent" representation produced by an encoder/decoder pair. This dramatically reduces computational requirements — the diffusion process operates on a much lower-dimensional representation — while maintaining output quality. The latent representation is decoded back to pixel space only for the final output. This efficiency is why latent diffusion models can run on consumer hardware; pure pixel-space diffusion at comparable quality would require much larger compute resources. The same latent diffusion principle extends to video by adding temporal dimensions to the architecture, denoising across both spatial and temporal dimensions simultaneously to produce coherent motion.
For B2B teams using or building AI video tools, diffusion models are the underlying technology in virtually every modern AI image and video generation tool — Midjourney, Stable Diffusion, Runway, Pika, Kling, and most others are diffusion-based. Understanding diffusion provides intuition for why these tools work the way they do: why prompting requires describing what you want (the prompt guides the denoising direction), why generations with the same prompt but different random seeds produce different results (different starting noise), and why higher iteration counts (more denoising steps) generally produce higher quality at the cost of generation time. This model helps practitioners prompt more effectively and set appropriate expectations for generation behavior.
Related terms
- Latent Diffusion— The compressed-space generative process under most modern AI tools — magic happening in the Room of Requirement: invisible, powerful, hard to explain.
- Stable Video Diffusion— The open-source video generation architecture — the Elvish forge where many modern AI video tools were first smelted.
- AI Video Generation— Video conjured from text and code — what the Hogwarts enchanted ceiling does, but for your product demo.
- Generative AI— AI that creates new content from scratch — the enchanted quill that writes its own stories, no enrollment required.
- LoRA— Like the One Ring: small, lightweight, but changes everything about how the model behaves once you put it on.