Latent Semantic Analysis
Use simple latent build techniques such as SVD on the term-document matrix representing the frequency of terms in documents.
- : each topic’s distribution over terms
- : diagonal matrix, can be seen as a topic importance / weight
- : each documents’s distribution over topics
Cons
- SVD has a significant computational cost
- No intuition about the origin of the topics
pLSA (Probabilistic LSA)
The topic distribution that characterizes a document in our collection determines which words should exist in it.
In a document , a word is generated from a single topic from the assumed ones, and given that topic, the word is independent of all the other words in the document.
We derive this through transforming from Joint Distribution for and (single word in the document) using Conditional probability distribution based on Bayes Theorem → Joint distribution for and (all words in the document)

Solution
EM Algorithm (probabilistic, not like the deterministic property of traditional LSA approach)
Cons
- Parameter grows linearly across document count
- To deal with new document, it needs to repeat EM
Latent semantic analysis
Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). A matrix containing word counts per document (rows represent unique words and columns represent each document) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.[1]
https://en.wikipedia.org/wiki/Latent_semantic_analysis
Probabilistic latent semantic analysis
Probabilistic latent semantic analysis (PLSA), also known as probabilistic latent semantic indexing (PLSI, especially in information retrieval circles) is a statistical technique for the analysis of two-mode and co-occurrence data. In effect, one can derive a low-dimensional representation of the observed variables in terms of their affinity to certain hidden variables, just as in latent semantic analysis, from which PLSA evolved.
https://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis

Seonglae Cho
