Importance sampling

Creator
Creator
Seonglae Cho
Created
Created
2024 Apr 4 15:36
Editor
Edited
Edited
2025 Apr 29 1:32
Not a mathematical proposal, just a real approximation

The stochasticity trick of Monte Carlo method (
Importance sampling
)

Assume you have a difficult integral to compute
f(x)dx=f(x)g(x)g(x)dx=Eg[f(X)g(X)]1mi=1mf(Xi)g(Xi),Xig\int f(x)dx = \int \frac{f(x)}{g(x)} g(x) dx = \mathbb{E}_g [\frac{f(X)}{g(X)}] \approx \frac{1}{m}\sum_{i=1}^m \frac{f(X_i)}{g(X_i)}, X_i \sim gEg[w(X)]1mi=1mw(Xi),w(X)=f(X)g(X)\mathbb{E}_g[w(X)] \approx \frac{1}{m}\sum_{i=1}^m w(X_i), w(X) = \frac{f(X)}{g(X)}
The Monte Carlo estimator for E[f(X)]E[f(X)] performs better than sampling from the original distribution when it has lower variance. For comparison, the variance of I=1Nf(xi)I=\frac{1}{N}\sum f(x_i) is 1NV(X)\frac{1}{N}V(X), while the variance of 1Ngf(xi)p(xi)g(xi)\frac{1}{N}\sum_g{f(x_i)\frac{p(x_i)}{g(x_i)}} is 1NV(f(x)p(x)g(x))\frac{1}{N}V(f(x)\frac{p(x)}{g(x)}) - with lower variance being preferable.
p(y)=p(x,y)dx=g(x)p(x,y)g(x)dx=Eg[w(X)]p(y) = \int p(x,y)dx = \int g(x) \frac{p(x,y)}{g(x)}dx = \mathbb{E}_g[w(X)]Ep(XY)[f(X)]=p(xY)f(x)dx=p(x,Y)p(Y)f(x)dx=1p(Y)p(x,Y)f(x)dx=1p(Y)g(x)g(x)p(x,Y)f(x)dx=Eg[w(X)f(X)]Eg[w(X)]1mi=1mw(Xi)f(Xi)1mi=1mw(Xi)\mathbb{E}_{p(X|Y)}[f(X)] = \int p(x|Y)f(x)dx \\= \int \frac{p(x, Y)}{p(Y)} f(x)dx \\= \frac{1}{p(Y)} \int p(x, Y)f(x)dx\\ = \frac{1}{p(Y)} \int \frac{g(x)}{g(x)} p(x, Y)f(x)dx \\= \frac{\mathbb{E}_{g}[w(X)f(X)]}{\mathbb{E}_{g}[w(X)]} \approx \frac{\frac{1}{m}\sum_{i=1}^m w(X_i)f(X_i)}{\frac{1}{m}\sum_{i=1}^m w(X_i)} Wi=w(Xi)j=1mw(Xj),Ep(XY)(f(X))i=1mWif(Xi)W_i = \frac{w(X_i)}{\sum_{j=1}^mw(X_j)}, E_{p(X|Y)}(f(X)) \approx \sum_{i=1}^m W_if(X_i)p(xy)=i=1mWiδxi(x)p(x|y) = \sum_{i=1}^m W_i \delta_{x_i}(x)
In short, Monte Carlo methods enable us to estimate any integral by random sampling. In
Bayesian Statistics
,
Evidence
is also form of integral so it becomes tractable.

Self-normalized importance sampling

Weight is probability of target p(x,y)p(x,y) divided by proposal distribution gg
Small setback: for the particular case where our function ff to integrate is the posterior p(xy)p(x|y), we can only evaluate W(x)W(x) up to a constant.
When we define unnormalized weights, w(x)=p(x,y)g(x)w(x) = \frac{p(x,y)}{g(x)} which can be used to approximate the
Evidence
.
p(y)=p(x,y)dx=g(x)p(x,y)g(x)dx=Eg[w(X)]p(y) = \int p(x,y)dx = \int g(x) \frac{p(x,y)}{g(x)}dx = \mathbb{E}_g[w(X)]
Use the unnormalized importance weights with the same sampled value xx from gg, to estimate both the numerator and the denominator.
Ep(XY)[f(X)]=p(xY)f(x)dx=p(x,Y)p(Y)f(x)dx=1p(Y)p(x,Y)f(x)dx=1p(Y)g(x)g(x)p(x,Y)f(x)dx=Eg[w(X)f(X)]Eg[w(X)]1mi=1mw(Xi)f(Xi)1mi=1mw(Xi)\mathbb{E}_{p(X|Y)}[f(X)] = \int p(x|Y)f(x)dx \\= \int \frac{p(x, Y)}{p(Y)} f(x)dx \\= \frac{1}{p(Y)} \int p(x, Y)f(x)dx\\ = \frac{1}{p(Y)} \int \frac{g(x)}{g(x)} p(x, Y)f(x)dx \\= \frac{\mathbb{E}_{g}[w(X)f(X)]}{\mathbb{E}_{g}[w(X)]} \approx \frac{\frac{1}{m}\sum_{i=1}^m w(X_i)f(X_i)}{\frac{1}{m}\sum_{i=1}^m w(X_i)}
And then, we could normalize unnormalized importance weight ww into WW, then
Wi=w(Xi)j=1mw(Xj),Ep(XY)(f(X))i=1mWif(Xi)W_i = \frac{w(X_i)}{\sum_{j=1}^mw(X_j)}, E_{p(X|Y)}(f(X)) \approx \sum_{i=1}^m W_if(X_i)
After that we can compute
Posterior
with
Delta function
for point mass
p(xy)=i=1mWiδxi(x)p(x|y) = \sum_{i=1}^m W_i \delta_{x_i}(x)

Likelihood Weighting

Using the prior p(x)p(x) as the proposal distribution g(x)g(x), we can weight samples by their likelihood. This provides a softer approach compared to
Rejection Sampling
, which uses hard rejection/acceptance when sampling latent variables.
In
Importance sampling
practice, a good proposal must be close to the posterior which might be quite different from the prior.
MCMC
algorithms draw samples from a target distribution by performing a biased
Random Walk
over the space of latent variables xx.

Annealed Importance Sampling (AIS)

Instead of using a fixed proposal distribution, samples are drawn through a process of gradually changing distributions, transitioning to the target distribution through intermediate distributions
MCMC
notion image
Consider the data according to the probability of trajectory occurrence, allowing mathematical use of
Off-policy
data to utilize more data while weighing by importance
notion image
notion image
 
 
 
 
 
 

Recommendations