Layer Normalization

Creator
Creator
Seonglae Cho
Created
Created
2019 Nov 19 7:51
Editor
Edited
Edited
2025 Mar 18 0:8

Per token transformation which has parameter (not a perfect distribution)

BN normalizes the activations of each batch, while LN normalizes the activations of each layer. Usually with
Residual Connection
like below.
LN=LayerNorm(x+SubLayer(x))LN = LayerNorm(x + SubLayer(x))
레이어의 출력을 정규화하여 학습 과정을 안정화해 레이어 간의 학습 속도 차이를 줄인다
You only need to change one dimension from
Batch Normalization
. After the introduction of the
Transformer Model
, there haven't been many changes, but one of the biggest changes is that
Layer Normalization
in the
Attention Mechanism
block changed from
Post-Norm
to
Pre-Norm
 

with linear transformation

정규화 과정이 모델의 표현력을 제한할 수 있다는 점을 보완하기 위함
  • γ\gamma - scale
  • β\beta - shift
  • eps - epsilon is a small value added to the denominator during the normalization process to prevent division by zero
  • element wise affine - learnable scaling and shifting operations applied to each element
LayerNorm(x)=xμσ2+eps×γ+βLayerNorm(x) = \frac{x - \mu}{\sqrt{\sigma^2 + eps}} \times \gamma + \beta
Layer Normalizations
 
 
 
 
 
 
 

Recommendations