Informational Entropy
A characteristic value that represents the shape of probability distribution and amount of information. It measures of information content; the number of bits actually required to store data; How random it is, How broad it is (Uniform distribution has maximum entropy)
The information content of a message is a function of how predictable it is. The information content (number of bits) needed to encode i is . So Next Token Prediction probability is containing information content itself.
The entropy of a message is the expected number of bits needed to encode it. (Shannon entropy)
Average Information Content
The entropy of a probability distribution can be interpreted as a measure of uncertainty, or lack of predictability
Information Content of Individual Events
Information content must be additive, meaning the total information content equals the sum of individual event information
Boltzmann–Gibbs Entropy = Shannon entropy (up to constant scaling) in Information Theory