Vanishing Gradient

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 May 23 1:47
Editor
Edited
Edited
2025 Nov 11 22:39

Due to the repeated multiplication of weights

Spectral radius
of the weight matrix accumulate since gradient accumulate in this direction.
When
Eigenvalue
is larger than 1
Exploding gradient
while
Vanishing Gradient
happens when eigenvalue is smaller than 1
Gradient information to be sufficiently passed through the network; Not too much (
Exploding gradient
), not too little (
Vanishing Gradient
)
Exploding gradient
and
Vanishing Gradient
typically occur due to non-linear components, though deep layers of linear transformations can also be problematic
While stacking more layers increases data representation capacity and should improve learning, in practice deeper networks often train poorly. This is the vanishing gradient phenomenon where gradient values become extremely small as they propagate away from the output layer. This occurs when the gradient of the
Activation Function
is much smaller than the activation's actual value. The problem was initially mitigated using
Tanh Function
, then solved by adopting
Non-saturating nonlinearity
functions like
ReLU
, etc.
Vanishing gradients are desirable to some extent as it is reasonable to assume that information near timestep is more useful than information far. Therefore vanishing gradients are okay if the information is not relevant.

Ordered -
Vanishing Gradient

Chaotic -
Exploding gradient

Edge of Chaos

Therefore, when performing
Weight Initialization
, setting ensures that gradients are stably propagated, latent representations are both expressive and stable, and the network reaches the critical learning regime (edge of chaos).
 
 
 
 
 

Recommendations