Due to the repeated multiplication of weights
Spectral radius of the weight matrix accumulate since gradient accumulate in this direction.
When Eigenvalue is larger than 1 Exploding gradient while Vanishing Gradient happens when eigenvalue is smaller than 1
Gradient information to be sufficiently passed through the network; Not too much (Exploding gradient), not too little (Vanishing Gradient )
Exploding gradient and Vanishing Gradient typically occur due to non-linear components, though deep layers of linear transformations can also be problematic
While stacking more layers increases data representation capacity and should improve learning, in practice deeper networks often train poorly. This is the vanishing gradient phenomenon where gradient values become extremely small as they propagate away from the output layer. This occurs when the gradient of the Activation Function is much smaller than the activation's actual value. The problem was initially mitigated using Tanh Function, then solved by adopting Non-saturating nonlinearity functions like ReLU, etc.
Vanishing gradients are desirable to some extent as it is reasonable to assume that information near timestep is more useful than information far. Therefore vanishing gradients are okay if the information is not relevant.
Ordered - Vanishing Gradient
Chaotic - Exploding gradient
Edge of Chaos
Therefore, when performing Weight Initialization, setting ensures that gradients are stably propagated, latent representations are both expressive and stable, and the network reaches the critical learning regime (edge of chaos).

Seonglae Cho