Adaptive Momentum - Andrea Perlato

Teh Adaptive Momentum AdaM stands for Adaptive Momentum. It combines the Momentum and RMS prop in a single approach making AdaM a very powerful and fast optimizer.

\[ \begin{aligned} V_{d W}=\beta_{1} V_{d b}+\left(1-\beta_{1}\right) d W ; V_{d b} &=\beta_{1} V_{d b}+\left(1-\beta_{1}\right) d b \\ V_{d W}^{c o r r e c t e d}=& \frac{V_{d W}}{1-\beta_{1}^{i}} ; V_{d b}^{c o r r e c t e d}=\frac{V_{d b}}{1-\beta_{1}^{i}} \\ S_{d W}=\beta_{2} S_{d W}+\left(1-\beta_{2}\right) d W^{2} ; S_{d b} &=\beta_{2} S_{d b}+\left(1-\beta_{2}\right) d b^{2} \\ S_{d W}=\beta_{2} S_{d W}+\left(1-\beta_{2}\right) d W^{2} ; S_{d b} &=\beta_{2} S_{d b}+\left(1-\beta_{2}\right) d b^{2} \\ S_{d W}^{c o r r e c t e d}=\frac{S_{d W}}{1-\beta_{2}^{i}} ; S_{d b}^{c o r r e c t e d} \\ W=W-\alpha \cdot \frac{V_{d V}}{\sqrt{S_{d r r e c t e d}}^{c o r r e c t e d}}+\epsilon \\ b=b-\alpha \cdot \frac{V_{d b}^{c o r r e c t e d}}{\sqrt{S_{d b}^{c o r r e c t e d}}}+\epsilon \end{aligned} \]

At the starting point the Vdw, Sdw and Vdb, Sdb are initialized at zero. At each interation t we compute the derivatives dw and db usng the current mini-batch. At this poin the calculate the exponentially weighted momentum Vdb and Vdw. The tipical AdaM a Scorrected is made to correct the bias and weigths. At the end the weight and bias update is perfomed. In practice, this algorithm combines the effect of gradient descent with momentum with gradient descent with root mean square propagation. This aogorithm has a variety of hyperparameters:
1 - the learning rate alpha needs to be tuned
2 - the default choise for the momentum parameter beta1 is 0.9
3 - the beta2 AdaM parameter is 0.999
4 - the choise of epsilon is recommended at 10^-8

AdaM means Adaptive Moment Estimation and beta1 is computing the mean of the derivatives dw and it is called the first moment. The parameter beta2 compute the dw^2 and is called the second moment.

Reference: coursera deep neural network course