Pretraining

The process of artificial neural networks extracting features from data and abstracting separation in each neuron.

The fact that more compression will lead to more intelligence that has a strong philosophical grounding.

Pretraining compresses data into generalized abstractions that connect different concepts through analogies, while reasoning is a specific

Problem Solving skill that involves careful thinking to unlock various problem-solving capabilities.

Data efficiency matters and there is an optimism. Algorithmic changes stack so well.
Sample efficiency almost with human level learning is still far away.

Semi-synchronous scaling might work with 10+ million GPUs in the future since not all parts of the brain necessarily need to communicate with each other.

For the scaling law, the problem is that extending the tail of lower probability requires 10x more computation since relevant concepts appear sparsely in the long tail

Dataset for AI are three types

Background information -
Pretraining

Problems with solution -
SFT

Practice problems -
Reinforcement Learning

Pre Training Notion

Foundation Model

Training Dataset Order

Model Checkpointing

Pretraining Dataset

How training process and loss value is related to neural network’s ability

Perhaps the most striking phenomenon the Anthropic have noticed is that the learning dynamics of toy models with large numbers of features appear to be dominated by "energy level jumps" where features jump between different feature dimensionalities.

Procedural Knowledge in Pretraining

In Cohere's Command R, the

Procedural memory related procedural knowledge showed strong correlations in document influence between similar types of math problems (e.g., gradient calculations). During the completion phase, the influence of individual documents was smaller than in task retrieval and more evenly distributed, suggesting that the model learns "solution procedures" rather than retrieving specific facts. Unlike Question Answering tasks where answer texts frequently appeared in top documents, they were rarely found in the Reasoning dataset, supporting the use of generalization.

In particular, math and code examples contributed significantly to reasoning in pre-training data, with code documents being identified as a major source for propagating procedural solutions. StackExchange as a source has more than ten times more influential data in the top and bottom portions of the rankings than expected if the influential data was randomly sampled from the pretraining distribution. Other code sources and ArXiv & Markdown are twice or more as influential as expected when drawing randomly from the pretraining distribution

arxiv.org

https://arxiv.org/pdf/2411.12580

Procedural Knowledge in Pretraining Drives LLM Reasoning

Laura’s personal website and blog

https://lauraruis.github.io/2024/11/10/if.html

Pretraining

The process of artificial neural networks extracting features from data and abstracting separation in each neuron.

Dataset for AI are three types

How training process and loss value is related to neural network’s ability

Procedural Knowledge in Pretraining

Backlinks

Recommendations