Informational Entropy
A characteristic value that represents the shape of probability distribution and amount of information. It measures of information content; the number of bits actually required to store data; How random it is, How broad it is (Uniform distribution has maximum entropy)
The information content of a message is a function of how predictable it is. The information content (number of bits) needed to encode i is . So Next Token Prediction probability is containing information content itself.
The entropy of a message is the expected number of bits needed to encode it. (Shannon entropy)
Average Information Content
The entropy of a probability distribution can be interpreted as a measure of uncertainty, or lack of predictability
Information Content of Individual Events
Information content must be additive, meaning the total information content equals the sum of individual event information