LSA

Creator

Creator

Created

Created

2024 Feb 24 2:12

Editor

Editor

Edited

Edited

2025 Mar 12 12:19

Refs

Refs

Latent Semantic Analysis

Use simple latent build techniques such as

SVD on the term-document matrix

X

representing the frequency of

N

terms in

D

documents.

X \approx W_K \Sigma_KC_K

$W_K$ : each $K$ topic’s distribution over $N$ terms

$\Sigma_K$ : diagonal matrix, can be seen as a topic importance / weight

$C_K$ : each $D$ documents’s distribution over $K$ topics

Cons

SVD has a significant computational cost

No intuition about the origin of the topics

pLSA (Probabilistic LSA)

The topic distribution that characterizes a document in our collection determines which words should exist in it.

In a document

d_j

, a word

w_{ji}

is generated from a single topic

z_{ji}

from the

K

assumed ones, and given that topic, the word is independent of all the other words in the document.

P(D, W) =\prod_{j=1}^{D} P(d_j)\prod_{i=1}^{N_j} \sum_{k=1}^{K} P(z_{ji} = k \mid d_j) P(w_{ji} \mid z_{ji} = k)

We derive this through transforming from

Joint Distribution for

d_j

and

w_i

(single word in the document) using

Conditional probability distribution based on

Bayes Theorem → Joint distribution for

d_j

and

w_j

(all words in the document)

notion image

Solution

EM Algorithm (probabilistic, not like the deterministic property of traditional LSA approach)

Cons

Parameter grows linearly across document count

To deal with new document, it needs to repeat EM

Latent semantic analysis

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). A matrix containing word counts per document (rows represent unique words and columns represent each document) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.[1]

Latent semantic analysis

https://en.wikipedia.org/wiki/Latent_semantic_analysis

Probabilistic latent semantic analysis

Probabilistic latent semantic analysis (PLSA), also known as probabilistic latent semantic indexing (PLSI, especially in information retrieval circles) is a statistical technique for the analysis of two-mode and co-occurrence data. In effect, one can derive a low-dimensional representation of the observed variables in terms of their affinity to certain hidden variables, just as in latent semantic analysis, from which PLSA evolved.

Probabilistic latent semantic analysis

https://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis

Backlinks

Topic model LDA Linear dimensionality reduction

Recommendations

////////