Double descent of Generalization performance
- First descent
- Second descent
Deep Learning presents a challenge to classical statistical learning theory. Neural networks often achieve zero training error, yet they generalize well to unseen data. This contradicts traditional expectations and makes many classical generalization bounds ineffective.
Sparse activation and the Superposition Hypothesis have been proposed as possible explanations for the Grokking phenomenon, where models learn to activate sparsely and generalize well after initially overfitting when trained on very large datasets.



Deep double descent
We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time. This effect is often avoided through careful regularization. While this behavior appears to be fairly universal, we don’t yet fully understand why it happens, and view further study of this phenomenon as an important research direction.
https://openai.com/index/deep-double-descent/

Deep Double Descent: Where Bigger Models and More Data Hurt
We show that a variety of modern deep learning tasks exhibit a "double-descent" phenomenon where, as we increase model size, performance first gets worse and then gets better. Moreover, we show...
https://arxiv.org/abs/1912.02292


Seonglae Cho