Text Tokenizer Training

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 May 18 6:47
Editor
Edited
Edited
2024 Nov 18 15:17
Refs
Refs
Large language models suffer from under-trained tokens because the tokenizer is trained independently.
 
 
 
 
 
Automatically Detecting Under-trained Tokens in Large Language Models
Analyzing the tokenizer's vocabulary and its encoding/decoding behavior, detecting untrained token candidates using metrics like the similarity between the embedding matrix and the final model layer.
Fishing for Magikarp: Automatically Detecting Under-trained Tokens...
The disconnect between tokenizer creation and model training in language models has been known to allow for certain inputs, such as the infamous SolidGoldMagikarp token, to induce unwanted...
Fishing for Magikarp: Automatically Detecting Under-trained Tokens...
 
 

Recommendations