牛顿法和拟牛顿法是求解无约束最优化的常用方法,有收敛速度快的优点. 牛顿法属于迭代算法,每一步需要求解目标函数的海赛矩阵的逆矩阵,计算复杂. 拟牛顿法通过正定矩阵近似海赛矩阵的逆矩阵,简化了这个过程.
对于无约束优化 minx∈Rnf(x) \min_{x\in R^n} f(x) x∈Rnminf(x) x∗x^*x∗是目标的极小值点.
假设f(x)f(x)f(x)有二阶连续偏导数,第k次迭代值为x(k)x^(k)x(k),将f(x)f(x)f(x)在x(k)x^(k)x(k)附近进行二阶展开: f(x)=f(x(k))+f′(x(k))Δx+12f′′(x(k)Δx2=f(x(k))+f′(x(k))(x−x(k))+12f′′(x(k))(x−x(k))2 \begin{aligned} f(x) &= f(x^{(k)}) + f'(x^{(k)})\Delta x + \frac{1}{2} f''(x^{(k)}\Delta x^2\\ & = f(x^{(k)}) + f'(x^{(k)})(x - x^{(k)}) + \frac{1}{2} f''(x^{(k)})(x - x^{(k)})^2 \end{aligned} f(x)=f(x(k))+f′(x(k))Δx+21f′′(x(k)Δx2=f(x(k))+f′(x(k))(x−x(k))+21f′′(x(k))(x−x(k))2 当这里的xxx是高维时 f(x)=f(x(k))+gkT(x−x(k))+12(x−x(k))TH(x(k))(x−x(k)) f(x) = f(x^{(k)}) +g_k^T(x-x^{(k)}) + \frac{1}{2}(x-x^{(k)})^TH(x^{(k)})(x-x^{(k)}) f(x)=f(x(k))+gkT(x−x(k))+21(x−x(k))TH(x(k))(x−x(k)) 其中,gk=g(x(k))=▽f(x(k))g_k = g(x^{(k)}) = \bigtriangledown f(x^{(k)})gk=g(x(k))=▽f(x(k))是f(x)f(x)f(x)的梯度向量在点x(k)x^{(k)}x(k)的值,H(x(k))H(x^{(k)})H(x(k))是f(x)f(x)f(x)的海赛矩阵在点x(k)x^{(k)}x(k)的值.
海赛矩阵式二阶导数矩阵 H(x)=[∂2f∂xi∂xj]n×n H(x) = \Big[\frac{\partial^2f}{\partial x_i\partial x_j}\Big]_{n\times n} H(x)=[∂xi∂xj∂2f]n×n
f(x)f(x)f(x)有极值的必要条件是在极值点处一阶导数为0,即梯度向量为0. 当H(x(k))H(x^{(k)})H(x(k))为正定矩阵时,f(x)f(x)f(x)的极值为极小值.
假设▽f(x(k+1))=0\bigtriangledown f(x^{(k+1)}) = 0▽f(x(k+1))=0,我们有 ▽f(x)=f′(x)+f′′(x)(x−x(k)) \bigtriangledown f(x) = f'(x) + f''(x)(x - x^{(k)}) ▽f(x)=f′(x)+f′′(x)(x−x(k)) (1.1)▽f(x)=gk+Hk(x−x(k)) \bigtriangledown f(x) = g_k + H_k(x-x^{(k)})\tag{1.1} ▽f(x)=gk+Hk(x−x(k))(1.1) 所以 gk+Hk(x−x(k))=0x(k+1)=x(k)−Hk−1gk g_k + H_k(x-x^{(k)}) = 0\\ x^{(k+1)} = x^{(k)} - H_k^{-1}g_k\\ gk+Hk(x−x(k))=0x(k+1)=x(k)−Hk−1gk 我们假设 Hkpk=−gk H_kp_k = -g_k Hkpk=−gk 则有 x(k+1)=x(k)+pk x^{(k+1)} = x^{(k)} + p_k x(k+1)=x(k)+pk
输入:目标函数f(x)f(x)f(x),梯度g(x)=▽f(x)g(x) = \bigtriangledown f(x)g(x)=▽f(x),海赛矩阵H(x)H(x)H(x),精度要求ϵ\epsilonϵ.
输出:f(x)f(x)f(x)的极小值点x∗x^*x∗.
基本思想:
海赛矩阵的逆矩阵计算困难,考虑用一个n阶矩阵Gk=G(x(k))G_k = G(x^{(k)})Gk=G(x(k))来近似替代Hk−1H_k^{-1}Hk−1.
在(1.1)(1.1)(1.1)中代入x=x(k+1)x = x^{(k+1)}x=x(k+1),即有如下: ▽f(x(k+1))=gk+Hk(x(k+1)−x(k))gk+1−gk=Hk(x(k+1)−x(k)) \bigtriangledown f(x^{(k+1)}) = g_k + H_k(x^{(k+1)}-x^{(k)})\\ g_{k+1} - g_k = H_k(x^{(k+1)} - x^{(k)}) ▽f(x(k+1))=gk+Hk(x(k+1)−x(k))gk+1−gk=Hk(x(k+1)−x(k))
记yk=g(k+1)−gk,δk=x(x+1)−x(k)y_k = g_{(k+1)} - g_k, \delta_k = x^{(x+1)} - x^{(k)}yk=g(k+1)−gk,δk=x(x+1)−x(k),则有 (2.1)yk=H)kδkHk−1yk=δk y_k = H)k\delta_k\tag{2.1}\\ H_k^{-1}y_k = \delta_k yk=H)kδkHk−1yk=δk(2.1) 我们称(2.1)(2.1)(2.1)为拟牛顿条件.
如果HkH_kHk是正定的,那么可以保证牛顿法搜索方向pkp_kpk是下降方向:
因为搜索方向是pk=−λgkp_k = -\lambda g_kpk=−λgk x=x(k)+λpk=x(k)−λHk(−1)gk x = x^{(k)} + \lambda p_k = x^{(k)} - \lambda H_k^{(-1)}g_k x=x(k)+λpk=x(k)−λHk(−1)gk
所以f(x)f(x)f(x)在x(k)x^{(k)}x(k)的泰勒展开式可以近似写成: f(x)=f(x(k))−λgkTHk−1gk f(x) = f(x^{(k)}) - \lambda g_k^TH_k^{-1}g_k f(x)=f(x(k))−λgkTHk−1gk 由于Hk−1H_k^{-1}Hk−1正定,所以有gkTHk−1gk>0g_k^TH_k^{-1}g_k>0gkTHk−1gk>0. 当λ\lambdaλ为一个充分小的正数时,总有f(x)<f(x(k))f(x)<f(x^{(k)})f(x)<f(x(k)),也就是说pkp_kpk是下降方向.
拟牛顿法将GkG_kGk作为Hk−1H_k^{-1}Hk−1的近似,要求矩阵GkG_kGk满足同样的条件,每次迭代矩阵GkG_kGk都是正定的,且GkG_kGk要满足拟牛顿条件: Gk1yk=δkG_{k_1}y_k = \delta_kGk1yk=δk
按照拟牛顿条件选择GkG_kGk作为Hk−1H_k^{-1}Hk−1的近似或选择BkB_kBk作为HkH_kHk的近似的算法称为拟牛顿法.
按照拟牛顿条件,每次迭代都可以更新矩阵: Gk+1=Gk+ΔGkG_{k+1} = G_k +\Delta G_kGk+1=Gk+ΔGk
有多种具体实现方法
DFP选择Gk+1G_{k+1}Gk+1的方法是,假设每一次迭代Gk+1G_{k+1}Gk+1都是由GkG_kGk加上两个附加项构成,我们假设这两个附加项分别为Qk、PkQ_k、P_kQk、Pk,则有 Gk+1=Gk+Qk+PkGk+1yk=Gkyk+Qkyk+Pkyk G_{k+1} = G_k + Q_k + P_k\\ G_{k+1}y_k = G_ky_k + Q_ky_k + P_ky_k Gk+1=Gk+Qk+PkGk+1yk=Gkyk+Qkyk+Pkyk
使Gk+1G_{k+1}Gk+1满足拟牛顿条件,可以使 Pkyk=δkQkyk=−Gkyk P_ky_k = \delta_k\\ Q_ky_k = - G_ky_k Pkyk=δkQkyk=−Gkyk
取 Pk=δkδkTδTykQ=−GkykykTGkykTGkyk P_k = \frac{\delta_k\delta_k^T}{\delta^Ty_k}\\ Q = - \frac{G_ky_ky_k^TG_k}{y_k^TG_ky_k} Pk=δTykδkδkTQ=−ykTGkykGkykykTGk
所以可以得到Gk+1G_{k+1}Gk+1的迭代公式 (2.2)Gk+1=Gk−GkykykTGkykTGkyk+δkδkTδTyk G_{k+1} = G_k - \frac{G_ky_ky_k^TG_k}{y_k^TG_ky_k} + \frac{\delta_k\delta_k^T}{\delta^Ty_k}\tag{2.2} Gk+1=Gk−ykTGkykGkykykTGk+δTykδkδkT(2.2)
算法过程:
输入:f(x)f(x)f(x),梯度g(x)=▽f(x)g(x) = \bigtriangledown f(x)g(x)=▽f(x),精度ϵ\epsilonϵ.
输出:f(x)f(x)f(x)的极小点x∗x^*x∗
BFGS是最流行得拟牛顿算法.
基本思想: 考虑用BkB_kBk逼近海赛矩阵HHH,则拟牛顿条件为:
Bk+1δk=ykB_{k+1}\delta_k = y_kBk+1δk=yk 令 Bk+1=Bk+Pk+QkBk+1δk=Bkδk+Pkδk+Qkδk B_{k+1} = B_k + P_k + Q_k\\ B_{k+1}\delta_k = B_k\delta_k + P_k\delta_k + Q_k\delta_k Bk+1=Bk+Pk+QkBk+1δk=Bkδk+Pkδk+Qkδk
使PkP_kPk和QkQ_kQk满足: Pkδk=ykQkδk=−Bkδk P_k\delta_k = y_k\\ Q_k\delta_k = -B_k\delta_k Pkδk=ykQkδk=−Bkδk 取 Pk=ykykTykTδkQk=−BkδkδkTBkδkTBkδk P_k = \frac{y_ky_k^T}{y_k^T\delta_k}\\ Q_k = - \frac{B_k\delta_k\delta_k^TB_k}{\delta_k^TB_k\delta_k} Pk=ykTδkykykTQk=−δkTBkδkBkδkδkTBk 我们得到了Bk+1B_{k+1}Bk+1的迭代公式: (2.3)Bk+1=Bk−BkδkδkTBkδkTBkδk+ykykTykTδk B_{k+1} = B_k - \frac{B_k\delta_k\delta_k^TB_k}{\delta_k^TB_k\delta_k} + \frac{y_ky_k^T}{y_k^T\delta_k}\tag{2.3} Bk+1=Bk−δkTBkδkBkδkδkTBk+ykTδkykykT(2.3) 可以证明,如果初始矩阵B0B_0B0是正定的,那么迭代过程中的每个矩阵BkB_kBk都是正定的.
算法过程:
输入:f(x)f(x)f(x),梯度g(x)=▽f(x)g(x) = \bigtriangledown f(x)g(x)=▽f(x),精度ϵ\epsilonϵ.
输出:f(x)f(x)f(x)的极小点x∗x^*x∗
可以从BFGS算法的BkB_kBk矩阵得到关于GkG_kGk的迭代公式. 若记Gk=Bk−1、Gk+1=Bk+1−1G_k = B_k^{-1}、G_{k+1} = B_{k+1}^{-1}Gk=Bk−1、Gk+1=Bk+1−1,对(2.3)(2.3)(2.3)用两次Sherman-Morrison公式可得
Gk+1=(I−δkykTδkTyk)Gk(I−δkykTδkTyk)T+δkykTδkTyk G_{k+1} = \Big(I - \frac{\delta_ky_k^T}{\delta^T_ky_k}\Big)G_k\Big(I - \frac{\delta_ky_k^T}{\delta^T_ky_k}\Big)^T + \frac{\delta_ky_k^T}{\delta^T_ky_k} Gk+1=(I−δkTykδkykT)Gk(I−δkTykδkykT)T+δkTykδkykT
李航-统计学习方法