Multilingual Tokenization

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Oct 29 1:13
Editor
Edited
Edited
2025 Nov 1 16:9
 
 
 
 
 

Montok

When tokenizing sentences with the same meaning across different languages, the number of tokens required varies significantly. Training 97 monolingual tokenizers under identical conditions revealed that differences in whitespace usage between languages are a major contributing factor.
Using
SuperBPE
, which ignores whitespace and allows word merging, improves overall compression rates and significantly reduces inequality between languages. However, remaining differences stemming from language-specific characteristics such as UTF-8 encoding remain a challenge to address in the future.
Generally, English is the most efficient (lowest token premium), while Southeast Asian, South Asian, and non-whitespace languages are the most disadvantaged.
 
 

Recommendations