Montok
When tokenizing sentences with the same meaning across different languages, the number of tokens required varies significantly. Training 97 monolingual tokenizers under identical conditions revealed that differences in whitespace usage between languages are a major contributing factor.
Using SuperBPE, which ignores whitespace and allows word merging, improves overall compression rates and significantly reduces inequality between languages. However, remaining differences stemming from language-specific characteristics such as UTF-8 encoding remain a challenge to address in the future.
Generally, English is the most efficient (lowest token premium), while Southeast Asian, South Asian, and non-whitespace languages are the most disadvantaged.

Seonglae Cho