Latent Dirichlet Allocation
Motivation
LDA is an algorithm that improves upon LSA and is suitable for Topic model.
Let's assume we have a number of topics, each defined as distributions over words. A document is generated through the following process: First, we choose a distribution over the topics. Then, for each word position, we select a topic assignment and choose a word from that corresponding topic.
Method
- For each of the topics, draw a multinomial distribution from a Dirichlet distribution with parameter which controls the mean shape and sparsity of
- For each of the documents, draw w a multinomial distribution from a Dirichlet distribution with parameter which controls the mean shape and sparsity of
- For each word position in a document
- Select a latent topic from the multinomial distribution
- Choose the observation from the multinomial distribution
and has parameters where is the size of the vocabulary across all documents.
Modeling
Posterior is impossible to compute so we approximate it
Approximation using Gibbs sampling
- Initialize probabilities randomly or uniformly
- In each step, replace the value of one of the variables by a value drawn from the distribution of that variable conditioned on the values of the remaining variables
- Repeat until convergence
Estimate the probability of assigning to each topic, conditioned on the topic assignments () of all other words (notation indicating the exclusion of )
From the above conditional distribution, sample a topic and set it as the new topic assignment of