Transformer Pretraining

Pretraining on more data shows no significant improvement in the model’s capability to acquire and maintain factual knowledge. Next, there is a power-law relationship between training steps and forgetting of memorization and generalization of factual knowledge, and LLMs trained with duplicated training data exhibit faster forgetting. Third, training LLMs with larger batch sizes can enhance the models’ robustness to forgetting.

Factual knowledge acquisition in LLM pretraining is achieved through accumulating micro-acquisitions. When the model is not presented with factual knowledge, forgetting occurs and the acquisition of the knowledge is gradually diluted.

learnability threshold

LLMs struggle to acquire unpopular knowledge because they need sufficient exposure to factual knowledge with intervals shorter than the learnability threshold to increase the probability. Deduplicating the pretraining corpus improves LLM performance by preventing the model from assigning a higher probability to duplicated sequences and helping it retain acquired generalization longer.

arxiv.org

https://arxiv.org/pdf/2406.11813

Including code improved non-code task performance significantly.

arxiv.org

https://arxiv.org/pdf/2408.10914

Transformer Pretraining

learnability threshold

Including code improved non-code task performance significantly.

Backlinks

Recommendations