Pre-Norm

Creator
Creator
Seonglae Cho
Created
Created
2024 Nov 25 15:31
Editor
Edited
Edited
2024 Nov 25 15:37
Refs
Refs
new improvement than
Post-Norm
xl+1=xl+f(LN(xl))x_{l+1} = x_l + f(\text{LN}(x_l))
Stable since the derivation term is simpler than
Post-Norm
without additional multiplication
xl=xLk=1L1f(LN(xk))xl\frac{\partial \ell}{\partial x_l} = \frac{\partial \ell}{\partial x_L} \prod_{k=1}^{L-1} \frac{\partial f(\text{LN}(x_k))}{\partial x_l}
And this Constant gradients enable larger learning rates
 
 
 
 
 

Recommendations