LSA

Created
Created
2024 Feb 24 2:12
Editor
Creator
Creator
Seonglae ChoSeonglae Cho
Edited
Edited
2025 Mar 12 12:19
Refs
Refs
LDA

Latent Semantic Analysis

Use simple latent build techniques such as
SVD
on the term-document matrix representing the frequency of terms in documents.
  • : each topic’s distribution over terms
  • : diagonal matrix, can be seen as a topic importance / weight
  • : each documents’s distribution over topics

Cons

  • SVD has a significant computational cost
  • No intuition about the origin of the topics

pLSA (Probabilistic LSA)

The topic distribution that characterizes a document in our collection determines which words should exist in it.
In a document , a word is generated from a single topic from the assumed ones, and given that topic, the word is independent of all the other words in the document.
We derive this through transforming from
Joint Distribution
for and (single word in the document) using
Conditional probability
distribution based on
Bayes Theorem
→ Joint distribution for and (all words in the document)
notion image

Solution

EM Algorithm
(probabilistic, not like the deterministic property of traditional LSA approach)

Cons

  • Parameter grows linearly across document count
  • To deal with new document, it needs to repeat EM
 
 
Latent semantic analysis
Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). A matrix containing word counts per document (rows represent unique words and columns represent each document) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.[1]
Probabilistic latent semantic analysis
Probabilistic latent semantic analysis (PLSA), also known as probabilistic latent semantic indexing (PLSI, especially in information retrieval circles) is a statistical technique for the analysis of two-mode and co-occurrence data. In effect, one can derive a low-dimensional representation of the observed variables in terms of their affinity to certain hidden variables, just as in latent semantic analysis, from which PLSA evolved.
 
 

Recommendations