Byte Level Tokenizer

Creator

Seonglae Cho

Created

2025 May 30 22:4

Editor

Seonglae Cho

Edited

2025 May 30 22:7

Refs

Byte Latent Transformer

When using UTF-8 encoding, each character is represented by 1-4 bytes, and each byte value is treated as a single token ID. Out-of-Vocabulary (OOV) Problem Solving: Since everything can always be represented at the byte level, even when unregistered words or neologisms appear, OOV problems fundamentally do not occur. This helps reduce model size as it doesn't require a very large vocabulary dictionary.

Backlinks

Tokenizer Substitute Dia

Recommendations

////////