Understanding diffusion models: How Stable Diffusion creates images

The field of AI-generated imagery has undergone a revolution in recent years, with diffusion models emerging as the dominant technology powering tools like Stable Diffusion, DALL-E, and Midjourney. These models can create astonishingly realistic and creative images from simple text descriptions, transforming how digital art is created and consumed. But beneath the user-friendly interfaces lies a sophisticated mathematical framework that progressively transforms random noise into coherent images. This article explores the inner workings of diffusion models, with a particular focus on Stable Diffusion, examining the principles, architecture, and technical innovations that make this remarkable technology possible.

The foundation: What are diffusion models?

Diffusion models represent a class of generative models fundamentally different from their predecessors like Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs). They are inspired by non-equilibrium thermodynamics—specifically, the process of gradually adding and then removing noise from data.

The core idea behind diffusion models involves two main processes:

1. Forward diffusion process

The forward process gradually adds random noise to an image until it becomes pure noise with no discernible structure. This process transforms a complex image distribution into a simple Gaussian (normal) distribution through a series of small steps.

Mathematically, at each timestep t, noise is added according to:

x_t = √(1-β_t) · x_{t-1} + √(β_t) · ε

Where:

  • x_t is the image at timestep t
  • β_t is a small noise schedule parameter
  • ε is random Gaussian noise

After many steps, the original image is completely transformed into random noise, effectively destroying all the structured information it contained.

2. Reverse diffusion process

The reverse process is where the magic happens. It starts with pure noise and gradually “denoises” it, step by step, to produce a coherent image. This is accomplished by training a neural network to predict the noise that was added at each step of the forward process, allowing it to progressively remove noise and recover structured data.

The key insight is that while the forward process destroys information in a simple, predefined way, the reverse process reconstructs information by learning complex patterns in the data distribution.

Stable Diffusion: A latent diffusion approach

Stable Diffusion, developed by Stability AI in collaboration with researchers from LMU Munich and RunwayML, introduces several crucial innovations to the diffusion model framework:

Latent space diffusion

Unlike earlier diffusion models that operated directly in pixel space (which is computationally intensive), Stable Diffusion operates in a compressed latent space:

  1. Dimensionality reduction: Images are first encoded into a lower-dimensional latent representation using a variational autoencoder (VAE).
  2. Diffusion in latent space: The diffusion process occurs in this compressed space rather than at the pixel level.
  3. Decoding to image space: After the reverse diffusion process, the resulting latent representation is decoded back into a full-resolution image.

This approach drastically reduces computational requirements while maintaining image quality. The latent space is typically 4-8x downsampled from the original image dimensions in each spatial direction, making the diffusion process significantly more efficient.

Architecture components

Stable Diffusion consists of several key components working together:

1. VAE Encoder/Decoder

The VAE compresses images into a lower-dimensional latent space and later reconstructs them:

  • Encoder: Converts images (typically 512×512 pixels) into latent representations (e.g., 64×64×4 tensors)
  • Decoder: Converts the latent representations back into full-resolution images

The VAE is trained separately and remains fixed during the diffusion model training.

2. U-Net with attention mechanisms

The core component is a U-Net architecture, which predicts the noise added at each diffusion step:

  • Convolutional layers: Process spatial information in the latent representation
  • Cross-attention layers: Connect text embeddings with the image latent space
  • Self-attention layers: Allow different parts of the image to influence each other
  • Residual connections: Help maintain information flow through the network

The U-Net has a characteristic shape with a contracting path that reduces spatial dimensions while increasing feature channels, followed by an expansive path that does the opposite, with skip connections between corresponding layers.

3. Text encoder

Stable Diffusion uses a pre-trained text encoder (typically CLIP’s text encoder) to convert text prompts into embeddings that guide the image generation:

  • Text tokens: The prompt is tokenized and processed through a transformer encoder
  • Text embeddings: Dense vector representations capture semantic meaning
  • Cross-attention: These embeddings influence the denoising process through cross-attention mechanisms

This component is crucial for text-to-image capabilities, allowing the model to understand and visualize text descriptions.

The technical process: From text to image

Generating an image with Stable Diffusion involves several distinct steps:

1. Text encoding

The text prompt is processed through the text encoder, producing embeddings that represent the semantic content the user wants to visualize. These embeddings guide the entire generation process.

2. Random noise initialization

A random noise tensor is sampled from a Gaussian distribution in the latent space. This acts as the starting point for the reverse diffusion process.

3. Iterative denoising

The U-Net progressively denoises the latent representation through multiple steps (typically 25-50):

Step algorithm:
1. Predict noise ε_θ at current timestep using the U-Net
2. Use this prediction to update the latent representation
3. Advance to the next timestep
4. Repeat until reaching the final timestep

During each step, the text embeddings influence the denoising through cross-attention, guiding the emerging image toward the desired content.

4. VAE decoding

After completing the denoising steps, the final latent representation is passed through the VAE decoder to produce the output image in pixel space.

Sampling strategies and optimizations

The basic sampling procedure described above can be enhanced with various strategies:

Classifier-free guidance

A crucial technique for controlling the influence of the text prompt:

predicted_noise = ε_uncond + guidance_scale · (ε_cond - ε_uncond)

Where:

  • ε_uncond is the noise prediction without text conditioning
  • ε_cond is the noise prediction with text conditioning
  • guidance_scale (typically 7-15) controls how closely the image follows the text

Higher guidance scales produce images that match the text more closely but may reduce diversity and increase artifacts.

DDIM sampling

Denoising Diffusion Implicit Models (DDIM) sampling allows for faster generation with fewer steps while maintaining quality:

  • Uses a deterministic update rule instead of adding random noise at each step
  • Enables non-Markovian trajectories through the latent space
  • Allows for consistent image editing through controlled trajectories

Negative prompting

Users can specify not only what they want but what they don’t want to see:

  • A negative prompt is encoded alongside the positive prompt
  • The model’s denoising process avoids characteristics specified in the negative prompt
  • Commonly used to avoid specific artifacts or unwanted elements

Technical innovations in Stable Diffusion versions

Stable Diffusion has evolved through several versions, each introducing important technical improvements:

Stable Diffusion 1.x

The original implementation established the core architecture:

  • 512×512 resolution with latent space diffusion
  • CLIP ViT-L/14 text encoder
  • 860M parameter U-Net
  • 4-channel latent space at 64×64 resolution

Stable Diffusion 2.x

Version 2 introduced several improvements:

  • OpenCLIP text encoder for better text understanding
  • Improved autoencoder for sharper details
  • Enhanced training data filtering
  • Text embedding conditioning on multiple resolutions in the U-Net

Stable Diffusion XL (SDXL)

The current state-of-the-art version features:

  • Dual text encoders (OpenCLIP and CLIP ViT-L/14)
  • Significantly larger U-Net with 2.6B parameters
  • Second conditioning stage for improved detail and composition
  • Improved sampling technique for greater coherence
  • Native 1024×1024 resolution output

Stable Diffusion 3 (upcoming)

Announced improvements include:

  • Multimodal conditioning beyond just text
  • Improved composition and anatomy
  • Enhanced text rendering capabilities
  • Better understanding of spatial relationships

Technical challenges and limitations

Despite their impressive capabilities, diffusion models like Stable Diffusion face several technical challenges:

Compositional understanding

The models often struggle with complex compositions:

  • Difficulty with counting objects or body parts
  • Challenges with spatial relationships between objects
  • Issues with consistent perspective and lighting

These problems stem from the lack of explicit 3D or scene graph representations in the model architecture.

Text rendering

Rendering readable text remains challenging:

  • Letters often appear distorted or nonsensical
  • Longer text passages become increasingly corrupted
  • Font consistency is difficult to maintain

This limitation relates to the model’s latent representation not preserving the fine details needed for legible text.

Computational efficiency

Despite the latent space optimization, diffusion models remain computationally intensive:

  • Multiple denoising steps require repeated forward passes
  • High-resolution images demand significant memory
  • Real-time applications remain challenging without specialized hardware

Prompt engineering complexity

Getting desired results often requires sophisticated prompt engineering:

  • Prompts must balance specificity with flexibility
  • The relationship between prompts and outputs isn’t always intuitive
  • Advanced techniques like negative prompting add complexity

Architectural insights: Why diffusion models work so well

The remarkable success of diffusion models like Stable Diffusion can be attributed to several key architectural advantages:

Stable training dynamics

Unlike GANs, which suffer from training instability and mode collapse, diffusion models have a well-defined and stable training objective—predicting the noise added at each step. This makes training more reliable and less prone to failure.

Incremental generation

The step-by-step denoising process allows the model to progressively refine its output, making decisions at multiple levels of detail. This contrasts with one-shot generation methods that must produce the entire image at once.

Information bottleneck

The latent space creates an information bottleneck that forces the model to learn efficient representations of images, capturing semantic structure rather than just pixel-level details.

Flexible conditioning

The cross-attention mechanism provides a flexible way to condition the generation process on various inputs (text, images, or other modalities), allowing for versatile applications.

Technical applications beyond basic generation

The architecture of Stable Diffusion enables various technical applications beyond simple text-to-image generation:

Image-to-image translation (img2img)

By initializing the latent space with an encoded existing image (partially noised) instead of pure noise, Stable Diffusion can transform images while maintaining their basic structure:

algorithm:
1. Encode input image to latent space using VAE encoder
2. Add noise to reach a specific diffusion timestep (strength parameter)
3. Denoise from this point forward, guided by text prompt
4. Decode final latent to output image

This enables applications like style transfer, colorization, and content editing.

Inpainting and outpainting

By masking specific regions of an image, Stable Diffusion can selectively regenerate just those areas:

  • Inpainting: Regenerating selected parts of an image
  • Outpainting: Extending an image beyond its original boundaries

These techniques utilize the same underlying diffusion process but applied only to masked regions, maintaining consistency with unmasked areas.

ControlNet and advanced conditioning

Extensions like ControlNet allow for conditioning the generation process on additional signals:

  • Edge maps for controlling structure
  • Depth maps for 3D consistency
  • Pose estimation for human figures
  • Segmentation maps for spatial control

These work by modifying the U-Net architecture to accept additional conditioning inputs alongside the text embeddings.

Personalization techniques

Various methods allow for personalizing Stable Diffusion to generate specific subjects or styles:

  • Textual Inversion: Learning new embeddings for specific concepts
  • DreamBooth: Fine-tuning the model to learn specific subjects
  • LoRA (Low-Rank Adaptation): Efficiently adapting the model with a small number of trainable parameters

These techniques enable generating consistent images of specific people, objects, or styles not present in the original training data.

The latent space: Where the magic happens

The latent space is central to understanding how Stable Diffusion works:

Semantic organization

The latent space is organized semantically, with similar visual concepts clustered together. This allows the model to navigate meaning, not just pixels:

  • Conceptually similar images have similar latent representations
  • Interpolating between latent vectors creates meaningful transitions
  • Operations in latent space correspond to semantic transformations

Mathematical properties

The latent space has important mathematical properties:

  • Approximately Gaussian distribution (easier to sample from)
  • Lower dimensionality (4×64×64 vs. 3×512×512) enabling efficient computation
  • Preservation of structural information while discarding noise and fine details

Manipulating the latent space

Advanced applications directly manipulate the latent space:

  • Vector arithmetic (adding and subtracting concepts)
  • Semantic editing by modifying specific dimensions
  • Latent space projections for finding the closest encodable image to a given target

Training process and data considerations

The training of Stable Diffusion involves several important technical aspects:

Training data

Stable Diffusion is trained on massive datasets of image-text pairs:

  • LAION-5B dataset filtered down to aesthetically pleasing images (LAION-Aesthetics)
  • Hundreds of millions of diverse images with accompanying text descriptions
  • Content filtering to remove problematic material

The quality and diversity of this dataset significantly impact the model’s capabilities and biases.

Training procedure

The training process involves:

  1. Preprocessing: Encoding images into the latent space using the pre-trained VAE
  2. Forward diffusion: Adding noise according to a predefined schedule
  3. Model prediction: Training the U-Net to predict the added noise
  4. Loss calculation: Comparing predicted noise with actual noise added
  5. Optimization: Updating model weights to minimize this loss

This process is computationally intensive, typically requiring multiple GPUs or TPUs over several weeks.

Conditioning embedding

A critical aspect of training is how text conditions are embedded:

  • Text is processed through a frozen text encoder (CLIP or OpenCLIP)
  • Embeddings are projected to the appropriate dimensions for the U-Net
  • Cross-attention layers learn to associate text features with corresponding image features

Architectural comparison with other generative models

To understand Stable Diffusion’s unique approach, it’s useful to compare it with other generative architectures:

Versus GANs

  • Training stability: Diffusion models have more stable training than GANs’ adversarial approach
  • Mode coverage: Diffusion models capture the entire data distribution better, avoiding mode collapse
  • Quality-diversity tradeoff: GANs can produce sharper images but with less diversity
  • Conditioning flexibility: Diffusion models offer more flexible conditioning mechanisms

Versus VAEs

  • Latent structure: Pure VAEs require more structured latent spaces; diffusion models work with simpler distributions
  • Sample quality: Diffusion models generally produce higher quality samples than standard VAEs
  • Hybrid approach: Stable Diffusion leverages VAEs for encoding/decoding while using diffusion for generation

Versus Autoregressive models

  • Parallelism: Diffusion models can denoise all spatial locations simultaneously; autoregressive models generate sequentially
  • Context handling: Autoregressive models can have better long-range coherence for some tasks
  • Computational tradeoffs: Diffusion requires multiple steps but each step is highly parallelizable

The future of diffusion models

Several technical trends point to the future evolution of diffusion models:

Multimodal capabilities

Integrating multiple modalities beyond just text and images:

  • Text-to-video generation (already emerging)
  • 3D content generation from text
  • Audio-visual synchronized generation
  • Cross-modal translation between different forms of media

Architectural improvements

Ongoing research is enhancing the fundamental architecture:

  • Consistency models for faster sampling with fewer steps
  • Improved attention mechanisms for better compositional understanding
  • Hierarchical diffusion for better handling of multiple scales
  • Transformer-based U-Nets for improved context understanding

Hardware optimization

Specialized hardware and software are making diffusion models more efficient:

  • Quantization to reduce memory requirements
  • Specialized diffusion accelerators
  • Distilled models that require fewer denoising steps
  • Optimized implementations for mobile and edge devices

Conclusion

Stable Diffusion represents a remarkable achievement in generative AI, combining theoretical insights from probabilistic modeling with practical architectural decisions to create a system capable of transforming text descriptions into vivid images. Its latent diffusion approach has proven to be a watershed moment, democratizing access to high-quality image generation by reducing computational requirements while maintaining impressive capabilities.

As the technology continues to evolve, we can expect more sophisticated control mechanisms, better understanding of complex compositions, and expansion into additional modalities. The core principles of the diffusion process—gradually structuring information from randomness—provide a powerful and flexible framework that will likely remain central to generative AI for years to come.

Understanding the technical foundations of diffusion models not only helps explain their current capabilities and limitations but also points toward the future developments that will further transform digital media creation. As these models become more powerful, accessible, and integrated into creative workflows, they promise to fundamentally change how we think about visual content generation and the relationship between language and imagery.