Transformer Pretraining

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Feb 24 16:22
Editor
Edited
Edited
2024 Aug 24 16:41
https://arxiv.org/pdf/2402.08268.pdf
Pretraining on more data shows no significant improvement in the model’s capability to acquire and maintain factual knowledge. Next, there is a power-law relationship between training steps and forgetting of memorization and generalization of factual knowledge, and LLMs trained with duplicated training data exhibit faster forgetting. Third, training LLMs with larger batch sizes can enhance the models’ robustness to forgetting.
Factual knowledge acquisition in LLM pretraining is achieved through accumulating micro-acquisitions. When the model is not presented with factual knowledge, forgetting occurs and the acquisition of the knowledge is gradually diluted.

learnability threshold

LLMs struggle to acquire unpopular knowledge because they need sufficient exposure to factual knowledge with intervals shorter than the learnability threshold to increase the probability. Deduplicating the pretraining corpus improves LLM performance by preventing the model from assigning a higher probability to duplicated sequences and helping it retain acquired generalization longer.

Including code improved non-code task performance significantly.

 
 
 

Recommendations