Weight Decay

Creator
Creator
Seonglae Cho
Created
Created
2023 May 11 6:13
Editor
Edited
Edited
2024 Mar 11 5:15

Imposing a penalty on the weight size itself

Regularized loss is equivalent to shrinking/decaying θ by a scalar factor of 1μλ1 - \mu \lambda and then apply standard gradient and that coefficient is decaying weight to prevent
Overfitting
Cost=OrdinaryLoss+RegularizationTermCost = OrdinaryLoss +Regularization Term
It is tradeoff to train both term at the same time.

when
L2 Norm

Lreg=λ12w22L_{reg} = \lambda\frac{1}{2}||w||_2^2
 
 

Use case

Layer Normalization
이나
Bias
parameter 는 크기 상관없으니 parameter group 때두고 한다
 
 
 
 
 

Recommendations