new improvement than Post-Norm xl+1=xl+f(LN(xl))x_{l+1} = x_l + f(\text{LN}(x_l))xl+1=xl+f(LN(xl))Stable since the derivation term is simpler than Post-Norm without additional multiplication∂ℓ∂xl=∂ℓ∂xL∏k=1L−1∂f(LN(xk))∂xl\frac{\partial \ell}{\partial x_l} = \frac{\partial \ell}{\partial x_L} \prod_{k=1}^{L-1} \frac{\partial f(\text{LN}(x_k))}{\partial x_l}∂xl∂ℓ=∂xL∂ℓ∏k=1L−1∂xl∂f(LN(xk))And this Constant gradients enable larger learning rates