LSA

Creator
Creator
Seonglae Cho
Created
Created
2024 Feb 24 2:12
Editor
Edited
Edited
2025 Mar 12 12:19
Refs
Refs
LDA

Latent Semantic Analysis

Use simple latent build techniques such as
SVD
on the term-document matrix XXrepresenting the frequency of NN terms in DD documents.
XWKΣKCKX \approx W_K \Sigma_KC_K
  • WKW_K: each KK topic’s distribution over NN terms
  • ΣK\Sigma_K: diagonal matrix, can be seen as a topic importance / weight
  • CKC_K: each DD documents’s distribution over KK topics

Cons

  • SVD has a significant computational cost
  • No intuition about the origin of the topics

pLSA (Probabilistic LSA)

The topic distribution that characterizes a document in our collection determines which words should exist in it.
In a document djd_j, a word wjiw_{ji} is generated from a single topic zjiz_{ji} from the KK assumed ones, and given that topic, the word is independent of all the other words in the document.
P(D,W)=j=1DP(dj)i=1Njk=1KP(zji=kdj)P(wjizji=k)P(D, W) =\prod_{j=1}^{D} P(d_j)\prod_{i=1}^{N_j} \sum_{k=1}^{K} P(z_{ji} = k \mid d_j) P(w_{ji} \mid z_{ji} = k)
We derive this through transforming from
Joint Distribution
for djd_j and wiw_i (single word in the document) using
Conditional probability
distribution based on
Bayes Theorem
→ Joint distribution for djd_j and wjw_j (all words in the document)
notion image

Solution

EM Algorithm
(probabilistic, not like the deterministic property of traditional LSA approach)

Cons

  • Parameter grows linearly across document count
  • To deal with new document, it needs to repeat EM
 
 
 
 

Recommendations