LDA

Latent Dirichlet Allocation

Motivation

LDA is an algorithm that improves upon

LSA and is suitable for

Let's assume we have a number of topics, each defined as distributions over words. A document is generated through the following process: First, we choose a distribution over the topics. Then, for each word position, we select a topic assignment and choose a word from that corresponding topic.

Method

For each of the $K$ topics, draw a multinomial distribution $\beta_k$ from a
Dirichlet distribution with parameter $\eta$ which controls the mean shape and sparsity of $\beta$

For each of the $D$ documents, draw w a multinomial distribution $\theta_j$ from a
Dirichlet distribution with parameter $\alpha$ which controls the mean shape and sparsity of $\theta$

For each word position $D_{ji}$ in a document $D_j$

Select a latent topic $z_{ji}$ from the multinomial distribution $\theta_j$
Choose the observation $w_{ji}$ from the multinomial distribution $\beta_{z_{ji}}$

\theta

and

\beta

has

V

parameters where

V

is the size of the vocabulary across all

D

documents.

Modeling

p(\mathbf{W}, \boldsymbol{\Theta}, \mathbf{B}, \mathbf{Z} \mid \boldsymbol{\alpha},\boldsymbol{\eta}) =\prod_{k=1}^{K} p(\boldsymbol{\beta}_k \boldsymbol{\eta})\prod_{j=1}^{D} p(\boldsymbol{\theta}_j \mid \boldsymbol{\alpha})\left(\prod_{i=1}^{N_j} p(z_{ji} \mid \boldsymbol{\theta}_j) p(w_{ji} \mid \mathbf{B}, z_{ji})\right)

Posterior is impossible to compute so we approximate it

p(\boldsymbol{\Theta}, \mathbf{B}, \mathbf{Z} \mid \mathbf{W}, \boldsymbol{\alpha}, \boldsymbol{\eta}) =\frac{p(\boldsymbol{\Theta}, \mathbf{B}, \mathbf{Z}, \mathbf{W} \mid \boldsymbol{\alpha}, \boldsymbol{\eta})}{\int_{\mathbf{B}} \int_{\boldsymbol{\Theta}} \sum_{\mathbf{Z}} p(\boldsymbol{\Theta}, \mathbf{B}, \mathbf{Z}, \mathbf{W} \mid \boldsymbol{\alpha}, \boldsymbol{\eta})}

Approximation using
Gibbs sampling

Initialize probabilities randomly or uniformly

In each step, replace the value of one of the variables by a value drawn from the distribution of that variable conditioned on the values of the remaining variables