Grokking

Creator

Creator

Created

Created

2024 Jan 14 11:57

Editor

Editor

Edited

Edited

2025 Jun 2 11:40

Refs

Refs

Emergent ability

Deep Learning presents a challenge to classical statistical learning theory. Neural networks often achieve zero training error, yet they generalize well to unseen data. This contradicts traditional expectations and makes many classical generalization bounds ineffective.

Sparse activation and the

Superposition Hypothesis have been proposed as possible explanations for the

Grokking phenomenon, where models learn to activate sparsely and generalize well after initially overfitting when trained on very large datasets.

notion image

Modern interpolating regime by Belkin et al. (2018) , , — Modern interpolating regime by Belkin et al. (2018)
Grokking, ,

From OpenAI and Google researchers examines how neural networks generalize on small, algorithmically generated datasets. A network significantly improves its generalization performance after a point of overfitting, achieving perfect generalization in certain cases. This study is significant as it delves into the understanding of generalization in overparameterized neural networks beyond just memorizing finite training datasets.

Neel Nanda
ICLR 2023 with
Mechanistic interpretability

Emergent Abilities of Large Language Models

Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon...

https://arxiv.org/abs/2206.07682

A Mechanistic Interpretability Analysis of Grokking — LessWrong

A significantly updated version of this work is now on Arxiv and was published as a spotlight paper at ICLR 2023 …

A Mechanistic Interpretability Analysis of Grokking — LessWrong

https://www.lesswrong.com/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking

A Mechanistic Interpretability Analysis of Grokking — LessWrong

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can…

https://ar5iv.labs.arxiv.org/html/2201.02177

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Acceleration method

Grokfast: Accelerated Grokking by Amplifying Slow Gradients

One puzzling artifact in machine learning dubbed grokking is where delayed generalization is achieved tenfolds of iterations after near perfect overfitting to the training data. Focusing on the...

https://arxiv.org/abs/2405.20233

Backlinks

Model Training Mechanistic interpretability Bias-Variance Trade-off Overfitting AI Memory Capacity Reversing Transformer Language Model Model Generalization

Recommendations

///////