$\omega^{[l]}=\left[\begin{array}{cc}{1.5} & {0} \\ {0} & {1.5}\end{array}\right] \quad \hat{y}=\omega^{[l]}\left[\begin{array}{ll}{1.5} & {0} \\ {0} & {1.5}\end{array}\right]^{L-1} x$ Considering to have a weight of 1.5 the estimation of y tends to explode along the neural network. On the contrary if the weight is 0.5 the estimation of y tends to vanish. In the formula above ^L^ is the number of layers, and as a consequence if the number of hidden layers is large the y estimate will explode when weight is high or will vanish when weight is small. Expecially when weight is to small, the gradient descent will take time to learn.
$\begin{array}{l}{z=\omega_{1} x_{1}+\omega_{2} x_{2}+\cdots \cdot v_{n} x_{n}} \\ {\quad {Large_ n} \rightarrow s_{smaller} \omega_{i}} \\ {\operatorname{Var}\left(\omega_{i}\right)=\frac{1}{n}}\end{array}$ The solutioin is given by the formula above where for a large number of hidden layers we select the variance weight wi as the ratio of 1 divided by the number of input features 1/n. If we use a ReLu activation function, 2/n is used instead. where n again is the number of input features. When we use the tanh activation function the Xavier initialization is used which is the sqrt(1/n).