Large language models suffer from under-trained tokens because the tokenizer is trained independently.
Automatically Detecting Under-trained Tokens in Large Language Models
Analyzing the tokenizer's vocabulary and its encoding/decoding behavior, detecting untrained token candidates using metrics like the similarity between the embedding matrix and the final model layer.