Text Tokenizer Training

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 May 18 6:47
Editor
Edited
Edited
2024 Nov 18 15:17
Refs
Refs
Large language models suffer from under-trained tokens because the tokenizer is trained independently.
 
 
 
 
 
Automatically Detecting Under-trained Tokens in Large Language Models
Analyzing the tokenizer's vocabulary and its encoding/decoding behavior, detecting untrained token candidates using metrics like the similarity between the embedding matrix and the final model layer.
 
 

Recommendations