Layer Normalization

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2019 Nov 19 7:51
Editor
Edited
Edited
2025 Dec 22 18:25

Per token transformation which has parameter (not a perfect distribution)

BN normalizes the activations of each batch, while LN normalizes the activations of each layer. Usually with
Residual Connection
like below.
Normalizes the output of a layer to stabilize the training process and reduce differences in learning rates between layers
You only need to change one dimension from
Batch Normalization
. After the introduction of the
Transformer Model
, there haven't been many changes, but one of the biggest changes is that
Layer Normalization
in the
Attention Mechanism
block changed from
Post-Norm
to
Pre-Norm
 

with linear transformation

To compensate for the fact that the normalization process can limit the model's expressive power
  • - scale
  • - shift
  • eps - epsilon is a small value added to the denominator during the normalization process to prevent division by zero
  • element wise affine - learnable scaling and shifting operations applied to each element
Layer Normalizations
 
 
 
 
Experimental evidence shows that removing LayerNorm (LN) from GPT-2 models results in minimal performance loss. The research points out that LN introduces non-linearities that complicate mechanical interpretability (
Linear Representation Hypothesis
). Since removing LN all at once breaks the model, researchers gradually replaced it with FakeLN (fixed scale) layer by layer and path by path, with minimal fine-tuning. The removal of LN eliminated Conf-Neurons (Confidence Neurons), reducing model overconfidence (low entropy was observed). Additionally, the
Attention Sink
phenomenon, where the L2 norm of the first token becomes excessively large, was diminished
 

Recommendations