Large language models suffer from under-trained tokens because the tokenizer is trained independently.
Automatically Detecting Under-trained Tokens in Large Language Models
Analyzing the tokenizer's vocabulary and its encoding/decoding behavior, detecting untrained token candidates using metrics like the similarity between the embedding matrix and the final model layer.
Fishing for Magikarp: Automatically Detecting Under-trained Tokens...
The disconnect between tokenizer creation and model training in language models has been known to allow for certain inputs, such as the infamous SolidGoldMagikarp token, to induce unwanted...
https://arxiv.org/abs/2405.05417


Seonglae Cho