Byte Level Tokenizer

Created
Created
2025 May 30 22:4
Editor
Creator
Creator
Seonglae Cho
Edited
Edited
2025 May 30 22:7
When using UTF-8 encoding, each character is represented by 1-4 bytes, and each byte value is treated as a single token ID. Out-of-Vocabulary (OOV) Problem Solving: Since everything can always be represented at the byte level, even when unregistered words or neologisms appear, OOV problems fundamentally do not occur. This helps reduce model size as it doesn't require a very large vocabulary dictionary.
 
 
 
 
 
 
 
 

Recommendations