Importance sampling

Not a mathematical proposal, just a real approximation

The stochasticity trick of Monte Carlo method (
Importance sampling)

Assume you have a difficult integral to compute

\int f(x)dx = \int \frac{f(x)}{g(x)} g(x) dx = \mathbb{E}_g [\frac{f(X)}{g(X)}] \approx \frac{1}{m}\sum_{i=1}^m \frac{f(X_i)}{g(X_i)}, X_i \sim g

\mathbb{E}_g[w(X)] \approx \frac{1}{m}\sum_{i=1}^m w(X_i), w(X) = \frac{f(X)}{g(X)}

The Monte Carlo estimator for

E[f(X)]

performs better than sampling from the original distribution when it has lower variance. For comparison, the variance of

I=\frac{1}{N}\sum f(x_i)

\frac{1}{N}V(X)

, while the variance of

\frac{1}{N}\sum_g{f(x_i)\frac{p(x_i)}{g(x_i)}}

\frac{1}{N}V(f(x)\frac{p(x)}{g(x)})

- with lower variance being preferable.

Importance sampling

p(y) = \int p(x,y)dx = \int g(x) \frac{p(x,y)}{g(x)}dx = \mathbb{E}_g[w(X)]

\mathbb{E}_{p(X|Y)}[f(X)] = \int p(x|Y)f(x)dx \\= \int \frac{p(x, Y)}{p(Y)} f(x)dx \\= \frac{1}{p(Y)} \int p(x, Y)f(x)dx\\ = \frac{1}{p(Y)} \int \frac{g(x)}{g(x)} p(x, Y)f(x)dx \\= \frac{\mathbb{E}_{g}[w(X)f(X)]}{\mathbb{E}_{g}[w(X)]} \approx \frac{\frac{1}{m}\sum_{i=1}^m w(X_i)f(X_i)}{\frac{1}{m}\sum_{i=1}^m w(X_i)}

W_i = \frac{w(X_i)}{\sum_{j=1}^mw(X_j)}, E_{p(X|Y)}(f(X)) \approx \sum_{i=1}^m W_if(X_i)

p(x|y) = \sum_{i=1}^m W_i \delta_{x_i}(x)

In short, Monte Carlo methods enable us to estimate any integral by random sampling. In

Bayesian Statistics,

Evidence is also form of integral so it becomes tractable.

Self-normalized importance sampling

Weight is probability of target

p(x,y)

divided by proposal distribution

g

Small setback: for the particular case where our function

f

to integrate is the posterior

p(x|y)

, we can only evaluate

W(x)

up to a constant.

When we define unnormalized weights,

w(x) = \frac{p(x,y)}{g(x)}

which can be used to approximate the

Evidence.

p(y) = \int p(x,y)dx = \int g(x) \frac{p(x,y)}{g(x)}dx = \mathbb{E}_g[w(X)]

Use the unnormalized importance weights with the same sampled value

x

from

g

, to estimate both the numerator and the denominator.

\mathbb{E}_{p(X|Y)}[f(X)] = \int p(x|Y)f(x)dx \\= \int \frac{p(x, Y)}{p(Y)} f(x)dx \\= \frac{1}{p(Y)} \int p(x, Y)f(x)dx\\ = \frac{1}{p(Y)} \int \frac{g(x)}{g(x)} p(x, Y)f(x)dx \\= \frac{\mathbb{E}_{g}[w(X)f(X)]}{\mathbb{E}_{g}[w(X)]} \approx \frac{\frac{1}{m}\sum_{i=1}^m w(X_i)f(X_i)}{\frac{1}{m}\sum_{i=1}^m w(X_i)}

And then, we could normalize unnormalized importance weight

w

into

W

, then

W_i = \frac{w(X_i)}{\sum_{j=1}^mw(X_j)}, E_{p(X|Y)}(f(X)) \approx \sum_{i=1}^m W_if(X_i)

After that we can compute

Posterior with

Delta function for point mass

p(x|y) = \sum_{i=1}^m W_i \delta_{x_i}(x)

Likelihood Weighting

Using the prior

p(x)

as the proposal distribution

g(x)

, we can weight samples by their likelihood. This provides a softer approach compared to

Rejection Sampling, which uses hard rejection/acceptance when sampling latent variables.

Importance sampling practice, a good proposal must be close to the posterior which might be quite different from the prior.

MCMC algorithms draw samples from a target distribution by performing a biased

Random Walk over the space of latent variables

x

Annealed Importance Sampling (AIS)

Instead of using a fixed proposal distribution, samples are drawn through a process of gradually changing distributions, transitioning to the target distribution through intermediate distributions

MCMC

Consider the data according to the probability of trajectory occurrence, allowing mathematical use of

Off-policy data to utilize more data while weighing by importance