Vanishing Gradient

One of the problems of training a deep neural network is the vanishing and exploding gradient: when we train a deep network the derivatives or the slope can get very big or very small or exponentially small and this makes training difficult. We have to choose very carefully the random weight initialization in order to avoid this problem.

\[ \omega^{[l]}=\left[\begin{array}{cc}{1.5} & {0} \\ {0} & {1.5}\end{array}\right] \quad \hat{y}=\omega^{[l]}\left[\begin{array}{ll}{1.5} & {0} \\ {0} & {1.5}\end{array}\right]^{L-1} x \] Considering to have a weight of 1.5 the estimation of y tends to explode along the neural network. On the contrary if the weight is 0.5 the estimation of y tends to vanish. In the formula above ^L^ is the number of layers, and as a consequence if the number of hidden layers is large the y estimate will explode when weight is high or will vanish when weight is small. Expecially when weight is to small, the gradient descent will take time to learn.

A partial solution is to initialize correctly the weight of a single neuron and then generalize it for the rest of the deep network.

\[ \begin{array}{l}{z=\omega_{1} x_{1}+\omega_{2} x_{2}+\cdots \cdot v_{n} x_{n}} \\ {\quad {Large_ n} \rightarrow s_{smaller} \omega_{i}} \\ {\operatorname{Var}\left(\omega_{i}\right)=\frac{1}{n}}\end{array} \] The solutioin is given by the formula above where for a large number of hidden layers we select the variance weight wi as the ratio of 1 divided by the number of input features 1/n. If we use a ReLu activation function, 2/n is used instead. where n again is the number of input features. When we use the tanh activation function the Xavier initialization is used which is the sqrt(1/n).

Reference: coursera deep neural network course