Zipfian distribution

Power law that

Dataset exhibiting scale-invariance and self-similarity driven by power-law relationship

a few words occur very often, and many words hardly ever occur

Word frequency is inversely proportional to rank. For example, the most frequent word appears approximately twice as often as the second most frequent word

where is means frequency and indicates the rank in a set of variables controlled by parameter

Practice follows theory quite well, but not entirely.

Zipfian distribution

Power law that

What will happen to this plot if we remove stop words from our vocabulary?

Recommendations