Multilingual Tokenization

Montok

When tokenizing sentences with the same meaning across different languages, the number of tokens required varies significantly. Training 97 monolingual tokenizers under identical conditions revealed that differences in whitespace usage between languages are a major contributing factor.

Using

SuperBPE, which ignores whitespace and allows word merging, improves overall compression rates and significantly reduces inequality between languages. However, remaining differences stemming from language-specific characteristics such as UTF-8 encoding remain a challenge to address in the future.

Generally, English is the most efficient (lowest token premium), while Southeast Asian, South Asian, and non-whitespace languages are the most disadvantaged.

explaining_tokenizer_inequities

catherinearnett • Updated 2025 Oct 31 2:41

arxiv.org

https://arxiv.org/pdf/2510.21909

catherinearnett/montok · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/datasets/catherinearnett/montok

Multilingual Tokenization

Montok

Recommendations