Zipfian distribution

Creator
Creator
Seonglae Cho
Created
Created
2025 Jan 22 11:0
Editor
Edited
Edited
2025 Jan 22 11:59

Power law that

Dataset exhibiting scale-invariance and self-similarity driven by power-law relationship
a few words occur very often, and many words hardly ever occur
Word frequency is inversely proportional to rank. For example, the most frequent word appears approximately twice as often as the second most frequent word
f(r)1rf(r) \propto \frac{1}{r}
where f(r)f(r) is means frequency and rr indicates the rank in a set of NN variables controlled by parameter s0s \ge 0
f(k;s,N)=ksiNisf(k;s, N) = \frac{k^{-s}}{\sum_i^Ni^{-s}}
notion image
Practice follows theory quite well, but not entirely.

What will happen to this plot if we remove stop words from our vocabulary?

 
 
 
 

Recommendations