Latent Semantic Analysis
Use simple latent build techniques such as SVD on the term-document matrix representing the frequency of terms in documents.
- : each topic’s distribution over terms
- : diagonal matrix, can be seen as a topic importance / weight
- : each documents’s distribution over topics
Cons
- SVD has a significant computational cost
- No intuition about the origin of the topics
pLSA (Probabilistic LSA)
The topic distribution that characterizes a document in our collection determines which words should exist in it.
In a document , a word is generated from a single topic from the assumed ones, and given that topic, the word is independent of all the other words in the document.
We derive this through transforming from Joint Distribution for and (single word in the document) using Conditional probability distribution based on Bayes Theorem → Joint distribution for and (all words in the document)
Solution
EM Algorithm (probabilistic, not like the deterministic property of traditional LSA approach)
Cons
- Parameter grows linearly across document count
- To deal with new document, it needs to repeat EM