Vanishing Gradient

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 May 23 1:47
Editor
Edited
Edited
2024 Nov 25 15:39

Due to the repeated multiplication of weights

Spectral radius
of the weight matrix accumulate since gradient accumulate in this direction.
Gradient information to be sufficiently passed through the network; Not too much (
Exploding gradient
), not too little (
Vanishing Gradient
)
Exploding gradient
Vanishing Gradient
보통 Non-linear component때문에 발생한다. 선형들의 Deep layers 도 문제긴 하지만
Layer를 많이 쌓을수록 데이터 표현력이 증가하기 때문에 학습이 잘 될 것 같지만 Layer가 많아질수록 학습이 잘 되지 않는다. 출력층에서 멀어질수록 Gradient 값이 매우 작아지는 현상인데,
Activation Function
의 Gradient는 Activation의 실제 값보다 훨씬 작을 때 생긴다.
Tanh Function
로 완화 후
Non-saturating nonlinearity
성질을 가지는
ReLU
, etc 함수들로 해결했다.
Vanishing gradients are desirable to some extent as it is reasonable to assume that information near timestep is more useful than information far. Therefore vanishing gradients are okay if the information is not relevant.
 
 
 
 
 
 

Recommendations