Adam Optimization Algorithm.
Adam refer to Adaptive Moment estimation.
要看懂这篇博文,你需要先看懂:
整理并翻译自吴恩达深度学习系列视频: https://mooc.study.163.com/learn/2001281003?tid=2001391036#/learn/content?type=detail&id=2001701052&cid=2001694315 RMSprop and the Adam optimization algorithm, is one of those rare algorithms that has really stood up, and has been shown to work well across a wide range of deep learning architectures. And the Adam optimization algorithm is basically taking momentum and RMSprop and putting them together.
基本思想是把动量梯度下降和RMSprop放在一起使用。
动量梯度下降部分:
vdw=β1vdw+(1−β1)dWv_{dw}=\beta_1 v_{dw}+(1-\beta_1)dWvdw=β1vdw+(1−β1)dW 即指数加权平均,下同。
vdb=β1vdb+(1−β1)dbv_{db}=\beta_1 v_{db}+(1-\beta_1)dbvdb=β1vdb+(1−β1)db
RMSprop部分:
Sdw=β2Sdw+(1−β2)dW2S_{dw}=\beta_2S_{dw}+(1-\beta_2)dW^2Sdw=β2Sdw+(1−β2)dW2<- element-wise 即平方版本的指数加权平均,下同
Sdb=β2Sdb+(1−β2)db2S_{db}=\beta_2S_{db}+(1-\beta_2)db^2Sdb=β2Sdb+(1−β2)db2 <- element-wise
起始bias修正:
vdwcorrect=vdw1−β1tv_{dw}^{correct}=\frac{v_{dw}}{1-\beta_1^t}vdwcorrect=1−β1tvdw
vdbcorrect=vdb1−β1tv_{db}^{correct}=\frac{v_{db}}{1-\beta_1^t}vdbcorrect=1−β1tvdb
Sdwcorrect=Sdw1−β2tS_{dw}^{correct}=\frac{S_{dw}}{1-\beta_2^t}Sdwcorrect=1−β2tSdw
Sdbcorrect=Sdb1−β2tS_{db}^{correct}=\frac{S_{db}}{1-\beta_2^t}Sdbcorrect=1−β2tSdb
更新parameter变成:
W=W−αvdwcorrect∗dWSdwcorrect+ϵW = W-\alpha \frac{v_{dw}^{correct}*dW}{\sqrt{S_{dw}^{correct}+\epsilon}}W=W−αSdwcorrect+ϵvdwcorrect∗dW 分子来自动量梯度下降 分母来自RMSprop 下同
b=b−αvdbcorrect∗dbSdbcorrect+ϵb = b-\alpha \frac{v_{db}^{correct}*db}{\sqrt{S_{db}^{correct}+\epsilon}}b=b−αSdbcorrect+ϵvdbcorrect∗db
Adam = adaptive moment estimation,自适应性炬估计。
β1\beta_1β1计算的是导数的均值(使用加权指数平均)。这称为第一炬(the first moment)。
β2\beta_2β2计算的是平方版本的指数加权平均。这称为第二炬(the second moment)。
这是Adam名称的由来,大家一般称之为:Adam Authorization Algorithm(Adam权威算法)。