Diffusion Model

Diffusion Probabilistic model (DPM), Variational Diffusion Model

The high level intuition is that the Denoising model is specialized for generating low-frequency content (forward process) and also for generating high-frequency content (reverse process).

The model is trained to predict masked portions of images, using structural tricks to minimize the number of pixels that need to be predicted, thereby preventing quality degradation. Diffusion models generate images by adding noise and then reversing it, which evenly removes information across the entire image, reducing pixel correlations.

Gradually add Gaussian noise with

Markov Chain to model increasing noise process. Generate images by sampling from Gaussian noise. After that, the decoder learns to reverse noise into images by denoising process with modeling noise distribution. Due to Explicit likelihood Modeling, it solves the drawback of

GAN where it covers less of the generation space.

Marginalization

\, p_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x}_0 | \mathbf{x}_{1:T}) p_\theta(\mathbf{x}_{1:T}) \, d\mathbf{x}_{1:T}

\text{ELBO}(q(x_{1:T} | x_0)) = \mathbb{E}_{q(x_{1:T} | x_0)} \left[ \log p_\theta(x_{0:T}) \right] - \mathbb{E}_{q(x_{1:T} | x_0)} \left[ \log q(x_{1:T} | x_0) \right] \\ \\\implies L = - \text{KL}\left(q(x_T | x_0) \| p_\theta(x_T)\right) - \sum_{t>1} \text{KL}\left(q(x_{t-1} | x_t, x_0) \| p_\theta(x_{t-1} | x_t)\right) + \mathbb{E}_{q(x_{1:T} | x_0)} \left[ \log p_\theta(x_0 | x_1) \right] \\ \text{Posterior/Prior divergence} + \text{Expected log likelihood}

Forward process

q(x_{1:T} \mid x_0) = \prod_{t=1}^T q(x_t \mid x_{t-1}), \quad q(x_t \mid x_{t-1}) \sim \mathcal{N}((1 - \beta_t)x_{t-1}, \beta_t \mathbf{I})

which means

x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon

When we consider it as a form of

Fourier Transform, we can interpret

x_t

as a

Linear Combination of

x_0

and

Gaussian Noise

\epsilon

x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon

Reverse process with

variational posterior

q(x_{t-1} \mid x_t, x_0) = \mathcal{N}(\tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t \mathbf{I}),

Reverse

KL Divergence for model seeking and to generate more expressive data. 하지만 Gaussian noise 를 저차원 형태로 모인 집합으로 reduction한 뒤 reconstruction 하므로 복잡한 차원에 대한 이해가 부족할 수 있다

p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(\mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

is the neural network we are training.

L_{t-1} \propto \frac{1}{2\sigma_t^2} \|\tilde{\mu}_t(x_t, x_0) - \mu_\theta(x_t, t)\|^2

When it is normal distribution

\tilde{\mu}_t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t \\ \tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t

Reparameterization trick

We compute KL of forward process and reverse process to get the Loss. To define the loss, the problem is reparameterized to predict the noise at step t rather than the structure, which empirically demonstrated to improve performance. Also, we leverage forward process's sampled

x_t

reparameterized to ensure the variational posterior differentiable.

L = \mathbb{E}_{t, x_0, \epsilon} \left[ \lambda_t \left\| \epsilon - \epsilon_\theta \left( \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, t \right) \right\|^2 \right]\\ \lambda_t = \frac{\beta_t^2}{2 \sigma_t^2 (1 - \beta_t)(1 - \bar{\alpha}_t)}

$\lambda_t$ is often fixed at 1 regardless of the step

$\epsilon_\theta$ is a neural network instead of $\mu_\theta$

Network Architecture

For the Encoder,

UNet like CNN image to image model to add noise but we can add

Attention Mechanism between during later compression/decompression stages such as

Cross-Attention and image patches. Diffusion uses

Positional Embedding for each time step. which prevents effective extrapolation like transformer do not.

Diffusion Model Notion

Diffusion Model Structure

Diffusion layer Transformer

DDPM

Diffusion Transformer

SDS Loss

Latent Diffusion Model

Diffusion Language Model

Score-based Diffusion Model

Diffusion Model Usages

Stable Diffusion WebUI

Fooocus

lllyasviel • Updated 2025 Jun 20 15:5

Diffusion Fine Tuning

SparseCtrl

Animate Anyone

CFG Scale

Distribution Matching Distillation

Diffusion Models

https://arxiv.org/pdf/2006.11239

arxiv.org

https://arxiv.org/pdf/2105.05233

Tutorial

arxiv.org

https://arxiv.org/pdf/2403.18103

smalldiffusion
yuanchenyang • Updated 2025 Jun 19 18:10

Through noise prediction, we can mathematically prove that the denoiser can be viewed as an "approximate projection" onto the data manifold, equivalent to the gradient of a smoothed distance function (

Moreau envelope). Gradient of a smoothed distance function to the manifold is equivalent as denoiser output, as a metaphor, trained denoiser generates force vectors that gradually bend towards the data manifold.

Then,

DDIM can be interpreted as gradient descent, combining momentum and

DDPM pddtechniques to improve convergence speed and image quality.

Diffusion models from scratch

This tutorial aims to give a gentle introduction to diffusion models, with a running example to illustrate how to build, train and sample from a simple diffusion model from scratch.