Deep double descent

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Sep 14 20:18
Editor
Edited
Edited
2026 Feb 22 20:10

Double descent of Generalization performance

In a broad definition,
Intelligence
can be defined by whether
Deep double descent
occurs or not. It goes beyond memorization to achieve generalization.
  1. First descent
  1. Overfitting
  1. Second descent
Deep Learning
presents a challenge to classical statistical learning theory. Neural networks often achieve zero training error, yet they generalize well to unseen data. This contradicts traditional expectations and makes many classical generalization bounds ineffective.
Sparse activation and the
Superposition Hypothesis
have been proposed as possible explanations for the
Grokking
phenomenon, where models learn to activate sparsely and generalize well after initially overfitting when trained on very large datasets.
notion image
Modern interpolating regime by Belkin et al. (2018) , ,
Modern interpolating regime by Belkin et al. (2018)
Grokking
, ,
https://openai.com/index/deep-double-descent/
 
 
 
 
 
Deep double descent
We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time. This effect is often avoided through careful regularization. While this behavior appears to be fairly universal, we don’t yet fully understand why it happens, and view further study of this phenomenon as an important research direction.
Deep double descent
Deep Double Descent: Where Bigger Models and More Data Hurt
We show that a variety of modern deep learning tasks exhibit a "double-descent" phenomenon where, as we increase model size, performance first gets worse and then gets better. Moreover, we show...
Deep Double Descent: Where Bigger Models and More Data Hurt
 
 

Recommendations