版权声明:所有的说明性文档基于 Creative Commons 协议, 所有的代码基于 MIT 协议. All documents are licensed under the Creative Commons License, all codes are licensed under the MIT License. https://cloud.tencent.com/developer/article/1446395
本文求解 softmax + cross-entropy 在反向传播中的梯度.
配套代码, 请参考文章 :
Python和PyTorch对比实现多标签softmax + cross-entropy交叉熵损失及反向传播
有关 softmax 的详细介绍, 请参考 :
softmax函数详解及反向传播中的梯度求导
有关 cross-entropy 的详细介绍, 请参考 :
通过案例详解cross-entropy交叉熵损失函数
系列文章索引 :
https://blog.csdn.net/oBrightLamp/article/details/85067981
在大多数教程中, softmax 和 cross-entropy 总是一起出现, 求梯度的时候也是一起考虑.
softmax 和 cross-entropy 的梯度, 已经在上面的两篇文章中分别给出.
考虑一个输入向量 x, 经 softmax 函数归一化处理后得到向量 s 作为预测的概率分布, 已知向量 y 为真实的概率分布, 由 cross-entropy 函数计算得出误差值 error (标量 e ), 求 e 关于 x 的梯度.
x=(x1,x2,x3,⋯ ,xk)s=softmax(x)si=exi∑t=1kexte=crossEntropy(s,y)=−∑i=1kyilog(si) \quad\ x = (x_1, x_2, x_3, \cdots, x_k)\ \quad\ s = softmax(x)\ \quad\ s_{i} = \frac{e^{x_{i}}}{ \sum_{t = 1}^{k}e^{x_{t}}} \ \quad\ e = crossEntropy(s, y) = -\sum_{i = 1}^{k}y_{i}log(s_{i})\ x=(x1,x2,x3,⋯,xk)s=softmax(x)si=∑t=1kextexie=crossEntropy(s,y)=−i=1∑kyilog(si)
已知 :
∇e(s)=∂e∂s=(∂e∂s1,∂e∂s2,⋯ ,∂e∂sk)=(−y1s1,−y2s2,⋯ ,−yksk) ∇s(x)=∂s∂x=(∂s1/∂x1∂s1/∂x2⋯∂s1/∂xk∂s2/∂x1∂s2/∂x2⋯∂s2/∂xk⋮⋮⋱⋮∂sk/∂x1∂sk/∂x2⋯∂sk/∂xk)=(−s1s1+s1−s1s2⋯−s1sk−s2s1−s2s2+s2⋯−s2sk⋮⋮⋱⋮−sks1−sks2⋯−sksk+sk) \nabla e_{(s)}=\frac{\partial e}{\partial s} =(\frac{\partial e}{\partial s_{1}},\frac{\partial e}{\partial s_{2}}, \cdots, \frac{\partial e}{\partial s_{k}}) =( -\frac{y_1}{s_1}, -\frac{y_2}{s_2},\cdots,-\frac{y_k}{s_k}) \ \;\ % ---------- \nabla s_{(x)}= \frac{\partial s}{\partial x}= \begin{pmatrix} \partial s_{1}/\partial x_{1}&\partial s_{1}/\partial x_{2}& \cdots&\partial s_{1}/\partial x_{k}\ \partial s_{2}/\partial x_{1}&\partial s_{2}/\partial x_{2}& \cdots&\partial s_{2}/\partial x_{k}\ \vdots & \vdots & \ddots & \vdots \ \partial s_{k}/\partial x_{1}&\partial s_{k}/\partial x_{2}& \cdots&\partial s_{k}/\partial x_{k}\ \end{pmatrix}= \begin{pmatrix} -s_{1}s_{1} + s_{1} & -s_{1}s_{2} & \cdots & -s_{1}s_{k} \ -s_{2}s_{1} & -s_{2}s_{2} + s_{2} & \cdots & -s_{2}s_{k} \ \vdots & \vdots & \ddots & \vdots \ -s_{k}s_{1} & -s_{k}s_{2} & \cdots & -s_{k}s_{k} + s_{k} \end{pmatrix} \ \quad\ ∇e(s)=∂s∂e=(∂s1∂e,∂s2∂e,⋯,∂sk∂e)=(−s1y1,−s2y2,⋯,−skyk)∇s(x)=∂x∂s=⎝⎜⎜⎜⎛∂s1/∂x1∂s2/∂x1⋮∂sk/∂x1∂s1/∂x2∂s2/∂x2⋮∂sk/∂x2⋯⋯⋱⋯∂s1/∂xk∂s2/∂xk⋮∂sk/∂xk⎠⎟⎟⎟⎞=⎝⎜⎜⎜⎛−s1s1+s1−s2s1⋮−sks1−s1s2−s2s2+s2⋮−sks2⋯⋯⋱⋯−s1sk−s2sk⋮−sksk+sk⎠⎟⎟⎟⎞
∂e∂xi=∂e∂s1∂s1∂xi+∂e∂s2∂s2∂xi+∂e∂s3∂s3∂xi+⋯+∂e∂sk∂sk∂xi \frac{\partial e}{\partial x_i} = \frac{\partial e}{\partial s_1}\frac{\partial s_1}{\partial x_i} +\frac{\partial e}{\partial s_2}\frac{\partial s_2}{\partial x_i} +\frac{\partial e}{\partial s_3}\frac{\partial s_3}{\partial x_i} + \cdots +\frac{\partial e}{\partial s_k}\frac{\partial s_k}{\partial x_i}\ ∂xi∂e=∂s1∂e∂xi∂s1+∂s2∂e∂xi∂s2+∂s3∂e∂xi∂s3+⋯+∂sk∂e∂xi∂sk
展开 ∂e/∂xi\partial e/\partial x_i∂e/∂xi 可得 e 关于 X 的梯度向量 :
∇e(x)=(∂e∂s1,∂e∂s2,∂e∂s3,⋯ ,∂e∂sk)(∂s1/∂x1∂s1/∂x2⋯∂s1/∂xk∂s2/∂x1∂s2/∂x2⋯∂s2/∂xk⋮⋮⋱⋮∂sk/∂x1∂sk/∂x2⋯∂sk/∂xk) ∇e(x)=∇e(s)∇s(x) \nabla e_{(x)} = (\frac{\partial e}{\partial s_1},\frac{\partial e}{\partial s_2},\frac{\partial e}{\partial s_3}, \cdots ,\frac{\partial e}{\partial s_k}) \begin{pmatrix} \partial s_{1}/\partial x_{1}&\partial s_{1}/\partial x_{2}& \cdots&\partial s_{1}/\partial x_{k}\ \partial s_{2}/\partial x_{1}&\partial s_{2}/\partial x_{2}& \cdots&\partial s_{2}/\partial x_{k}\ \vdots & \vdots & \ddots & \vdots \ \partial s_{k}/\partial x_{1}&\partial s_{k}/\partial x_{2}& \cdots&\partial s_{k}/\partial x_{k}\ \end{pmatrix}\ \;\ \nabla e_{(x)} =\nabla e_{(s)} \nabla s_{(x)}\ ∇e(x)=(∂s1∂e,∂s2∂e,∂s3∂e,⋯,∂sk∂e)⎝⎜⎜⎜⎛∂s1/∂x1∂s2/∂x1⋮∂sk/∂x1∂s1/∂x2∂s2/∂x2⋮∂sk/∂x2⋯⋯⋱⋯∂s1/∂xk∂s2/∂xk⋮∂sk/∂xk⎠⎟⎟⎟⎞∇e(x)=∇e(s)∇s(x)
由于 :
∇e(s)=(−y1s1,−y2s2,⋯ ,−yksk) ∇s(x)=(−s1s1+s1−s1s2⋯−s1sk−s2s1−s2s2+s2⋯−s2sk⋮⋮⋱⋮−sks1−sks2⋯−sksk+sk) \nabla e_{(s)}=( -\frac{y_1}{s_1}, -\frac{y_2}{s_2},\cdots,-\frac{y_k}{s_k})\ \;\ \nabla s_{(x)} =\begin{pmatrix} -s_{1}s_{1} + s_{1} & -s_{1}s_{2} & \cdots & -s_{1}s_{k} \ -s_{2}s_{1} & -s_{2}s_{2} + s_{2} & \cdots & -s_{2}s_{k} \ \vdots & \vdots & \ddots & \vdots \ -s_{k}s_{1} & -s_{k}s_{2} & \cdots & -s_{k}s_{k} + s_{k} \end{pmatrix} ∇e(s)=(−s1y1,−s2y2,⋯,−skyk)∇s(x)=⎝⎜⎜⎜⎛−s1s1+s1−s2s1⋮−sks1−s1s2−s2s2+s2⋮−sks2⋯⋯⋱⋯−s1sk−s2sk⋮−sksk+sk⎠⎟⎟⎟⎞
得 :
∇e(x)=(s1∑t=1kyt−y1, s2∑t=1kyt−y2,⋯ ,si∑t=1kyt−yi) ∂e∂xi=si∑t=1kyt−yi \nabla e_{(x)}= (s_1\sum_{t = 1}^{k}y_t- y_1, \;s_2\sum_{t = 1}^{k}y_t- y_2,\cdots,s_i\sum_{t = 1}^{k}y_t- y_i)\ \;\ \frac{\partial e}{\partial x_i} =s_i\sum_{t = 1}^{k}y_t- y_i ∇e(x)=(s1t=1∑kyt−y1,s2t=1∑kyt−y2,⋯,sit=1∑kyt−yi)∂xi∂e=sit=1∑kyt−yi
结论:
将 softmax 和 cross-entropy 放在一起使用, 可以大大减少梯度求解的计算量.