Diffusion Model

Creator
Creator
Seonglae Cho
Created
Created
2022 Aug 24 14:49
Editor
Edited
Edited
2024 Dec 28 16:40

Diffusion Probabilistic model (DPM)

The high level intuition is that the Denoising model is specialized for generating low-frequency content (forward process) and also for generating high-frequency content (reverse process).
이미지 생성을 위해 이미지의 가려진 일부를 예측하도록 학습되는데, 구조적인 트릭으로 모델이 예측해야 하는 픽셀의 수를 최소화하여 품질 저하를 막는다. 확산모델은 이미지에 노이즈를 추가하고 되돌리는 방식으로 이미지를 생성하는데, 이를 통해 이미지 전체에 대한 정보를 골고루 제거하여 픽셀 간의 상간관계가 줄어든다.
Gradually add Gaussian noise with
Markov Chain
to model increasing noise process. Generate images by sampling from Gaussian noise. After that, the decoder learns to reverse noise into images by denoising process with modeling noise distribution. Due to Explicit likelihood Modeling, it solves the drawback of
GAN
where it covers less of the generation space.

Marginalization

pθ(x)=pθ(x0x1:T)pθ(x1:T)dx1:T\, p_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x}_0 | \mathbf{x}_{1:T}) p_\theta(\mathbf{x}_{1:T}) \, d\mathbf{x}_{1:T}ELBO(q(x1:Tx0))=Eq(x1:Tx0)[logpθ(x0:T)]Eq(x1:Tx0)[logq(x1:Tx0)]    L=KL(q(xTx0)pθ(xT))t>1KL(q(xt1xt,x0)pθ(xt1xt))+Eq(x1:Tx0)[logpθ(x0x1)]Posterior/Prior divergence+Expected log likelihood\text{ELBO}(q(x_{1:T} | x_0)) = \mathbb{E}_{q(x_{1:T} | x_0)} \left[ \log p_\theta(x_{0:T}) \right] - \mathbb{E}_{q(x_{1:T} | x_0)} \left[ \log q(x_{1:T} | x_0) \right] \\ \\\implies L = - \text{KL}\left(q(x_T | x_0) \| p_\theta(x_T)\right) - \sum_{t>1} \text{KL}\left(q(x_{t-1} | x_t, x_0) \| p_\theta(x_{t-1} | x_t)\right) + \mathbb{E}_{q(x_{1:T} | x_0)} \left[ \log p_\theta(x_0 | x_1) \right] \\ \text{Posterior/Prior divergence} + \text{Expected log likelihood}

Forward process

q(x1:Tx0)=t=1Tq(xtxt1),q(xtxt1)N((1βt)xt1,βtI) q(x_{1:T} \mid x_0) = \prod_{t=1}^T q(x_t \mid x_{t-1}), \quad q(x_t \mid x_{t-1}) \sim \mathcal{N}((1 - \beta_t)x_{t-1}, \beta_t \mathbf{I}) 
which means
xt=αˉtx0+1αˉtϵ x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon 
When we consider it as a form of
Fourier Transform
, we can interpret xtx_t as a
Linear Combination
of x0x_0 and
Gaussian Noise
ϵ\epsilon
xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon

Reverse process with

variational posterior q(xt1xt,x0)=N(μ~t(xt,x0),β~tI),q(x_{t-1} \mid x_t, x_0) = \mathcal{N}(\tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t \mathbf{I}),
Reverse
KL Divergence
for model seeking and to generate more expressive data. 하지만 Gaussian noise 를 저차원 형태로 모인 집합으로 reduction한 뒤 reconstruction 하므로 복잡한 차원에 대한 이해가 부족할 수 있다
pθ(xt1xt)=N(μθ(xt,t),Σθ(xt,t))p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(\mu_\theta(x_t, t), \Sigma_\theta(x_t, t)) is the neural network we are training.
Lt112σt2μ~t(xt,x0)μθ(xt,t)2 L_{t-1} \propto \frac{1}{2\sigma_t^2} \|\tilde{\mu}_t(x_t, x_0) - \mu_\theta(x_t, t)\|^2 
When it is normal distribution
μ~t(xt,x0)=αˉt1βt1αˉtx0+αt(1αˉt1)1αˉtxtβ~t=1αˉt11αˉtβt\tilde{\mu}_t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t \\ \tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t

Reparameterization trick

We compute KL of forward process and reverse process to get the Loss. To define the loss, the problem is reparameterized to predict the noise at step t rather than the structure, which empirically demonstrated to improve performance. Also, we leverage forward process's sampled xtx_t reparameterized to ensure the variational posterior differentiable.
L=Et,x0,ϵ[λtϵϵθ(αˉtx0+1αˉtϵ,t)2]λt=βt22σt2(1βt)(1αˉt)L = \mathbb{E}_{t, x_0, \epsilon} \left[ \lambda_t \left\| \epsilon - \epsilon_\theta \left( \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, t \right) \right\|^2 \right]\\ \lambda_t = \frac{\beta_t^2}{2 \sigma_t^2 (1 - \beta_t)(1 - \bar{\alpha}_t)}
  • λt\lambda_t is often fixed at 1 regardless of the step
  • ϵθ\epsilon_\theta is a neural network instead of μθ\mu_\theta

Network Architecture

For the Encoder,
UNet
like CNN image to image model to add noise but we can add
Attention Mechanism
between during later compression/decompression stages such as
Cross-Attention
and image patches. Diffusion uses
Positional Embedding
for each time step. which prevents effective extrapolation like transformer do not.
Diffusion Model Notion
 
 
 
Diffusion Model Usages
 
 
 
Diffusion Models
 
 
 
 
 

Recommendations