Imposing a penalty on the weight size itself
Regularized loss is equivalent to shrinking/decaying θ by a scalar factor of and then apply standard gradient and that coefficient is decaying weight to prevent Overfitting
It is tradeoff to train both term at the same time.
when L2 Norm
Use case
Layer Normalization 이나 Bias parameter 는 크기 상관없으니 parameter group 때두고 한다