Variational AutoEncoder

Sample from the mean and standard deviation to compute latent sample

VAEs are regularized autoencoders where the form of the regularizer is defined by the prior (

ELBO). 쉽게말해 variational inference 에서 사용하는 normal distribution 같은 걸로만 표현하도록 하는 분포로 approximation 시키도록 강제하는 일종의 regularizer

p_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x} \mid \mathbf{z}) p_\theta(\mathbf{z}) \, d\mathbf{z}

Instead of theta, use mean and variance as weights

p_\theta(z) = N(\mu, \sigma^2) \rightarrow p_\theta(z) = \mu + \sigma^2 \odot N(0, 1)

Variational Inference Similarly, when a Deterministic node appears before a Stochastic node in a

Computational Graph, we use the reparameterization trick to move the stochastic node to a leaf position, enabling

\mathcal{L}_{\text{ELBO}}(q_\phi(z|x)) = -\beta \, \text{KL} \left( q_\phi(z|x) \| p_\theta(z) \right) + \mathbb{E}_{q_\phi(z|x)} \left[ \log p_\theta(x|z) \right] \\ \text{Posterior/Prior divergence} + \text{Expected log likelihood}

Standard hyperparameters is larger than 100

\frac{1}{B} \sum_{i=1}^{B} \left( -\beta \, \text{KL} \left( q_\phi(z|x_i) \| p_\theta(z) \right) + \log p_\theta(x_i|z_i) \right)

with reparameterization on normal distribution (

j

is each dimension of latent space)

\frac{1}{B} \sum_{i=1}^{B} \left( -\beta \sum_{j=1}^{J} \left( 1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2 \right) + \log p_\theta(x_i | z_i) \right)

Entropy

H

R

means KL (additional information from Z to X) D means reconstruction error

\begin{aligned}H &= -\mathbb{E}_{p(x)} \left[ \log p(x) \right] \\D &= -\mathbb{E}_{p_\phi(z|x)p_\theta(x)} \left[ \log p_\theta(x|z) \right] \\R &= \mathbb{E}_{p(x)} \left[ \text{KL}\left( p_\phi(z|x) \| p_\phi(z) \right)\right]\end{aligned}

Each point defines bounds on the mutual information

Auto-Encoding limit (D = 0, R = H) - All structure is encoded in the latent variables.

Therefore, the additional information to encode is the randomness. (R = H)
Sufficiently powerful decoder will be able to perfectly decode the latent (D = 0)

Posterior over latent variables is ignored. $p(z|x) = p(z)$ (R=0)
Z contains none of X's structure, and the Decoder reconstructs X independently (D=H)

Neither are desirable, so between two point there are ideal point

Keep KL doesn’t collapse and stably converges with compare train and validation of reconstruction loss to identify under/overfitting