Variational AutoEncoder

Creator
Creator
Seonglae Cho
Created
Created
2023 May 24 8:39
Editor
Edited
Edited
2025 Jul 3 9:49

VAEs

Sample from the mean and standard deviation to compute latent sample
VAEs are regularized autoencoders where the form of the regularizer is defined by the prior (
ELBO
). 쉽게말해 variational inference 에서 사용하는 normal distribution 같은 걸로만 표현하도록 하는 분포로 approximation 시키도록 강제하는 일종의 regularizer

Marginalization

pθ(x)=pθ(xz)pθ(z)dzp_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x} \mid \mathbf{z}) p_\theta(\mathbf{z}) \, d\mathbf{z}

Reparameterization trick

Instead of theta, use mean and variance as weights
pθ(z)=N(μ,σ2)pθ(z)=μ+σ2N(0,1)p_\theta(z) = N(\mu, \sigma^2) \rightarrow p_\theta(z) = \mu + \sigma^2 \odot N(0, 1)
notion image
Variational Inference
Similarly, when a Deterministic node appears before a Stochastic node in a
Computational Graph
, we use the reparameterization trick to move the stochastic node to a leaf position, enabling
Back Propagation
.
LELBO(qϕ(zx))=βKL(qϕ(zx)pθ(z))+Eqϕ(zx)[logpθ(xz)]Posterior/Prior divergence+Expected log likelihood\mathcal{L}_{\text{ELBO}}(q_\phi(z|x)) = -\beta \, \text{KL} \left( q_\phi(z|x) \| p_\theta(z) \right) + \mathbb{E}_{q_\phi(z|x)} \left[ \log p_\theta(x|z) \right] \\ \text{Posterior/Prior divergence} + \text{Expected log likelihood}
 

Mini batch

Standard hyperparameters is larger than 100
1Bi=1B(βKL(qϕ(zxi)pθ(z))+logpθ(xizi))\frac{1}{B} \sum_{i=1}^{B} \left( -\beta \, \text{KL} \left( q_\phi(z|x_i) \| p_\theta(z) \right) + \log p_\theta(x_i|z_i) \right)
with reparameterization on normal distribution (jj is each dimension of latent space)
1Bi=1B(βj=1J(1+logσj2μj2σj2)+logpθ(xizi))\frac{1}{B} \sum_{i=1}^{B} \left( -\beta \sum_{j=1}^{J} \left( 1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2 \right) + \log p_\theta(x_i | z_i) \right)
 
 

Mutual information
HD<I(Z,X)<RH - D <I(Z, X) < R

Entropy HH, RR means KL (additional information from Z to X) D means reconstruction error
H=Ep(x)[logp(x)]D=Epϕ(zx)pθ(x)[logpθ(xz)]R=Ep(x)[KL(pϕ(zx)pϕ(z))]\begin{aligned}H &= -\mathbb{E}_{p(x)} \left[ \log p(x) \right] \\D &= -\mathbb{E}_{p_\phi(z|x)p_\theta(x)} \left[ \log p_\theta(x|z) \right] \\R &= \mathbb{E}_{p(x)} \left[ \text{KL}\left( p_\phi(z|x) \| p_\phi(z) \right)\right]\end{aligned}
Each point defines bounds on the mutual information
https://arxiv.org/pdf/1711.00464
  • Auto-Encoding limit (D = 0, R = H) - All structure is encoded in the latent variables.
    • Therefore, the additional information to encode is the randomness. (R = H)
    • Sufficiently powerful decoder will be able to perfectly decode the latent (D = 0)
  • Auto-Decoding limit (D = H, R = 0)
    • Posterior over latent variables is ignored. p(zx)=p(z)p(z|x) = p(z) (R=0)
    • Z contains none of X's structure, and the Decoder reconstructs X independently (D=H)
Neither are desirable, so between two point there are ideal point

Training Dynamics

Keep KL doesn’t collapse and stably converges with compare train and validation of reconstruction loss to identify under/overfitting
 
 
 
 

Recommendations