Layer Normalization

Creator

Creator

Created

Created

2019 Nov 19 7:51

Editor

Editor

Edited

Edited

2025 Jul 24 10:46

Refs

Refs

Batch Normalization

Gradient Normalization

Per token transformation which has parameter (not a perfect distribution)

BN normalizes the activations of each batch, while LN normalizes the activations of each layer. Usually with

Residual Connection like below.

LN = LayerNorm(x + SubLayer(x))

레이어의 출력을 정규화하여 학습 과정을 안정화해 레이어 간의 학습 속도 차이를 줄인다

You only need to change one dimension from

Batch Normalization. After the introduction of the

Transformer Model, there haven't been many changes, but one of the biggest changes is that

Layer Normalization in the

Attention Mechanism block changed from

with linear transformation

정규화 과정이 모델의 표현력을 제한할 수 있다는 점을 보완하기 위함

$\gamma$ - scale

$\beta$ - shift

eps - epsilon is a small value added to the denominator during the normalization process to prevent division by zero

element wise affine - learnable scaling and shifting operations applied to each element

LayerNorm(x) = \frac{x - \mu}{\sqrt{\sigma^2 + eps}} \times \gamma + \beta

Layer Normalizations

RMS Normalization

DyT

https://arxiv.org/pdf/1607.06450.pdf

Experimental evidence shows that removing LayerNorm (LN) from GPT-2 models results in minimal performance loss. The research points out that LN introduces non-linearities that complicate mechanical interpretability (

Linear representation hypothesis). Since removing LN all at once breaks the model, researchers gradually replaced it with FakeLN (fixed scale) layer by layer and path by path, with minimal fine-tuning. The removal of LN eliminated Conf-Neurons (Confidence Neurons), reducing model overconfidence (low entropy was observed). Additionally, the

Attention Sink phenomenon, where the L2 norm of the first token becomes excessively large, was diminished

https://arxiv.org/pdf/2507.02559

Backlinks

Transformer Inference Transformer Training Transformer Training Weight Decay PEFT Batch Normalization Layer Normalization Transformer Model Transformer Inference Neural Network Layer

Recommendations

////////