Influence Function

Calculate the changes in logits caused by data points

influence functions can identify problematic documents within the continued pretraining corpus, enabling more targeted curation of safer language-specific data. There is currently very limited work on analyzing training-example-to-output relationships for multilingual safety-relevant behaviors.

EK-FAC

Since the model parameters are in the billions, it is physically impossible to directly calculate, store, and invert the entire Hessian matrix. Therefore, we approximate it using Kronecker factorization to represent the full layer Hessian as the Kronecker product of these two matrices. Additionally, instead of using simple gradient information, we use second-order information from the

Fisher Information Matrix for more accurate estimation.

I have a question about the naming convention. Currently we are using names like Agent_1 and Agent_0, but Raj mentioned that clients might prefer more readable names such as Search Agent or SNS Writer Agent.

Results

Rare tokens tend to have high influence from a small number of training examples, while frequent tokens show more distributed influence. As model size increases, even with less token overlap, semantically similar sequences show high influence, demonstrating higher levels of generalization and abstraction. While strong generalization was observed in high-resource languages, low-resource languages showed limited performance. Additionally, exact matching datasets were important in mathematical programming. When word order in sentences is reversed, the influence almost disappears, confirming that LLMs heavily utilize sequential information.

Limitation

In non-convex optimization regions, Hessian approximation accuracy may decrease, and certain phenomena like order sensitivity cannot be fully explained. Future work requires more sophisticated curvature approximation methods and expansion of candidate filtering techniques.

EK-FAC & Eigenvalue Correction with TF-IDF filtering
Nelson Elhage

arxiv.org

https://arxiv.org/pdf/2308.03296

Procedural Knowledge in Pretraining

In Cohere's Command R, the

Procedural memory related procedural knowledge showed strong correlations in document influence between similar types of math problems (e.g., gradient calculations). During the completion phase, the influence of individual documents was smaller than in task retrieval and more evenly distributed, suggesting that the model learns "solution procedures" rather than retrieving specific facts. Unlike Question Answering tasks where answer texts frequently appeared in top documents, they were rarely found in the Reasoning dataset, supporting the use of generalization.

In particular, math and code examples contributed significantly to reasoning in pre-training data, with code documents being identified as a major source for propagating procedural solutions. StackExchange as a source has more than ten times more influential data in the top and bottom portions of the rankings than expected if the influential data was randomly sampled from the pretraining distribution. Other code sources and ArXiv & Markdown are twice or more as influential as expected when drawing randomly from the pretraining distribution

arxiv.org

https://arxiv.org/pdf/2411.12580

Procedural Knowledge in Pretraining Drives LLM Reasoning

Laura’s personal website and blog

https://lauraruis.github.io/2024/11/10/if.html