Per token transformation which has parameter (not a perfect distribution)
BN normalizes the activations of each batch, while LN normalizes the activations of each layer. Usually with Residual Connection like below.
레이어의 출력을 정규화하여 학습 과정을 안정화해 레이어 간의 학습 속도 차이를 줄인다
You only need to change one dimension from Batch Normalization. After the introduction of the Transformer Model, there haven't been many changes, but one of the biggest changes is that Layer Normalization in the Attention Mechanism block changed from Post-Norm to Pre-Norm
with linear transformation
정규화 과정이 모델의 표현력을 제한할 수 있다는 점을 보완하기 위함
- - scale
- - shift
eps
- epsilon is a small value added to the denominator during the normalization process to prevent division by zero
- element wise affine - learnable scaling and shifting operations applied to each element
Layer Normalizations
Experimental evidence shows that removing LayerNorm (LN) from GPT-2 models results in minimal performance loss. The research points out that LN introduces non-linearities that complicate mechanical interpretability (Linear representation hypothesis). Since removing LN all at once breaks the model, researchers gradually replaced it with FakeLN (fixed scale) layer by layer and path by path, with minimal fine-tuning. The removal of LN eliminated Conf-Neurons (Confidence Neurons), reducing model overconfidence (low entropy was observed). Additionally, the Attention Sink phenomenon, where the L2 norm of the first token becomes excessively large, was diminished