Overfitting

Predicts training dataset > test, High
Variance (Spurious correlation)

Learning to memorize individual samples rather than generalizing from the data

Models that are bigger or have more capacity are more likely to overfit. In other words, large number of parameters can cause overfitting since it allows for more accurate fitting to the training data, resulting in higher variance of the model. However If data is general enough, overfitting is okay like

Deep double descent.

Overfitting is related to information being stored in the weights that encodes the training set, as opposed to the data generating distribution. This corresponds to reducing the concentration of the distribution of weight vectors output by the algorithm.

Modern approach

Deep Learning presents a challenge to classical statistical learning theory. Neural networks often achieve zero training error, yet they generalize well to unseen data. This contradicts traditional expectations and makes many classical generalization bounds ineffective.

Sparse activation and the

Superposition Hypothesis have been proposed as possible explanations for the

Grokking phenomenon, where models learn to activate sparsely and generalize well after initially overfitting when trained on very large datasets.

Modern interpolating regime by Belkin et al. (2018) , , — Modern interpolating regime by *Belkin et al. (2018)*
Grokking, ,

Resolve Overfitting

Non-parametric algorithm

Regularized parameter

Local KL Volume

This methodology defines a set of KL-neighbors (behaviorally similar region) around the trained model weights and efficiently estimates the probability (=Local KL Volume) that this region occupies under the initialization distribution using

Monte Carlo Method +

Importance sampling. Local KL Volume measures the "size" of the parameter region where the output distribution remains nearly unchanged (KL divergence ≤ ε) from the perspective of the initialization distribution.

C(\theta) = \mathbb{E}_{x}\Bigl[D_{\mathrm{KL}}\bigl(f(x;\theta_{0}) \,\|\, f(x;\theta)\bigr)\Bigr] \le \varepsilon

KL volume is data-dependent, and the ratio of KL local volume between test and train datasets can be used to assess

Overfitting. If the ratio of valid to train is less than 1, it indicates overfitting; if it is close to 1, it's optimal; and if greater than 1, it suggests

Underfitting. Using second-moment information from optimizers like

Adam Optimizer reduces directional variance, significantly decreasing the variance in volume estimation. The negative log of local volume can be interpreted as network information content (from an

MDL perspective) and linked to generalization performance. As training progresses towards overfitting, local volume decreases (complexity increases).

arxiv.org

https://arxiv.org/pdf/2501.18812

Overfitting

Predicts training dataset > test, High Variance (Spurious correlation)

Modern approach

Local KL Volume

Recommendations

Predicts training dataset > test, High
Variance (Spurious correlation)