Grokking

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Jan 14 11:57
Editor
Edited
Edited
2025 Nov 26 18:32
Deep Learning
presents a challenge to classical statistical learning theory. Neural networks often achieve zero training error, yet they generalize well to unseen data. This contradicts traditional expectations and makes many classical generalization bounds ineffective.
Sparse activation and the
Superposition Hypothesis
have been proposed as possible explanations for the
Grokking
phenomenon, where models learn to activate sparsely and generalize well after initially overfitting when trained on very large datasets.
notion image
Modern interpolating regime by Belkin et al. (2018) , ,
Modern interpolating regime by Belkin et al. (2018)
Grokking
, ,
From OpenAI and Google researchers examines how neural networks generalize on small, algorithmically generated datasets. A network significantly improves its generalization performance after a point of overfitting, achieving perfect generalization in certain cases. This study is significant as it delves into the understanding of generalization in overparameterized neural networks beyond just memorizing finite training datasets.
 
 
 
 

Neel Nanda
ICLR
2023 with
Mechanistic interpretability

To find "progress measures," a transformer model was trained on modular addition tasks where grokking was observed. Reverse-engineering the implemented algorithm revealed that the model maps inputs to rotations on a circle and corresponds addition to rotation, specifically using discrete Fourier transforms and trigonometric identities to perform addition operations.
  • Restricted Loss: Loss when non-key frequencies are removed
  • Excluded Loss: Loss when only key frequencies are removed
Generalization
  • Memorization (memorizing training data)
  • Circuit formation (forming generalizable algorithms internally)
  • Cleanup (removing memorization mechanisms)

Acceleration method

 
 

Recommendations