Logistic回归是解决二分类问题的分类算法。假设有mmm个训练样本{(x(1),y(1)),(x(2),y(2)),⋯,(x(m),y(m))}{(x(1),y(1)),(x(2),y(2)),⋯,(x(m),y(m))}\left { \left ( \mathbf{x}^{(1)},y^{(1)} \right ),\left ( \mathbf{x}^{(2)},y^{(2)} \right ),\cdots ,\left ( \mathbf{x}^{(m)},y^{(m)} \right ) \right },对于Logistic回归,其输入特征为:x(i)∈ℜn+1x(i)∈ℜn+1\mathbf{x}^{(i)}\in \Re ^{n+1},类标记为:y(i)∈{0,1}y(i)∈{0,1}y^{(i)}\in \left { 0,1 \right },假设函数为Sigmoid函数:
hθ(x)=11+e−θTxhθ(x)=11+e−θTx
h_\theta \left ( x \right )=\frac{1}{1+e^{-\theta ^Tx}}
其中,模型的参数为θθ\theta ,需要通过最小化损失函数得到,模型的损失函数为:
J(θ)=−1m∑i=1my(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))J(θ)=−1m∑i=1my(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))
J\left ( \theta \right )=-\frac{1}{m}\sum_{i=1}^{m}\left y^{(i)}logh_\theta \left ( \mathbf{x}^{(i)} \right )+\left ( 1-y^{(i)} \right )log\left ( 1-h_\theta \left ( \mathbf{x}^{(i)} \right ) \right ) \right
此时,可以通过梯度下降法对其进行求解,其梯度为:
▿θjJ(θ)=−1m∑mi=1y(i)hθ(x(i))⋅▿θjhθ(x(i))+1−y(i)1−hθ(x(i))⋅▿θj(1−hθ(x(i)))=−1m∑mi=1y(i)hθ(x(i))⋅▿θjhθ(x(i))−1−y(i)1−hθ(x(i))⋅▿θjhθ(x(i))=−1m∑mi=1(y(i)hθ(x(i))−1−y(i)1−hθ(x(i)))⋅▿θjhθ(x(i))▽θjJ(θ)=−1m∑i=1my(i)hθ(x(i))⋅▽θjhθ(x(i))+1−y(i)1−hθ(x(i))⋅▽θj(1−hθ(x(i)))=−1m∑i=1my(i)hθ(x(i))⋅▽θjhθ(x(i))−1−y(i)1−hθ(x(i))⋅▽θjhθ(x(i))=−1m∑i=1m(y(i)hθ(x(i))−1−y(i)1−hθ(x(i)))⋅▽θjhθ(x(i))
\begin{matrix} \triangledown _{\theta _j}J\left ( \theta \right )=-\frac{1}{m}\sum_{i=1}^{m}\left \frac{y^{(i)}}{h_\theta \left ( \mathbf{x}^{(i)} \right )}\cdot \triangledown _{\theta _j}h_\theta \left ( \mathbf{x}^{(i)} \right )+\frac{1-y^{(i)}}{1-h_\theta \left ( \mathbf{x}^{(i)} \right )}\cdot \triangledown _{\theta _j}\left ( 1-h_\theta \left ( \mathbf{x}^{(i)} \right ) \right )\right \ =-\frac{1}{m}\sum_{i=1}^{m}\left \frac{y^{(i)}}{h_\theta \left ( \mathbf{x}^{(i)} \right )}\cdot \triangledown _{\theta _j}h_\theta \left ( \mathbf{x}^{(i)} \right )-\frac{1-y^{(i)}}{1-h_\theta \left ( \mathbf{x}^{(i)} \right )}\cdot \triangledown _{\theta _j}h_\theta \left ( \mathbf{x}^{(i)} \right ) \right \ =-\frac{1}{m}\sum_{i=1}^{m}\left \left ( \frac{y^{(i)}}{h_\theta \left ( \mathbf{x}^{(i)} \right )}-\frac{1-y^{(i)}}{1-h_\theta \left ( \mathbf{x}^{(i)} \right )} \right )\cdot \triangledown _{\theta _j}h_\theta \left ( \mathbf{x}^{(i)} \right ) \right \end{matrix}
=−1m∑mi=1y(i)−hθ(x(i))hθ(x(i))(1−hθ(x(i)))⋅▿θjhθ(x(i))=−1m∑mi=1y(i)−hθ(x(i))hθ(x(i))(1−hθ(x(i)))⋅▿θTx(i)hθ(x(i))⋅▿θj(θTx(i))=−1m∑i=1my(i)−hθ(x(i))hθ(x(i))(1−hθ(x(i)))⋅▽θjhθ(x(i))=−1m∑i=1my(i)−hθ(x(i))hθ(x(i))(1−hθ(x(i)))⋅▽θTx(i)hθ(x(i))⋅▽θj(θTx(i))
\begin{matrix} =-\frac{1}{m}\sum_{i=1}^{m}\left \frac{y^{(i)}-h_\theta \left ( \mathbf{x}^{(i)} \right )}{h_\theta \left ( \mathbf{x}^{(i)} \right )\left ( 1-h_\theta \left ( \mathbf{x}^{(i)} \right ) \right )}\cdot \triangledown _{\theta _j}h_\theta \left ( \mathbf{x}^{(i)} \right ) \right \ =-\frac{1}{m}\sum_{i=1}^{m}\left \frac{y^{(i)}-h_\theta \left ( \mathbf{x}^{(i)} \right )}{h_\theta \left ( \mathbf{x}^{(i)} \right )\left ( 1-h_\theta \left ( \mathbf{x}^{(i)} \right ) \right )}\cdot \triangledown _{\theta ^T\mathbf{x}^{(i)}}h_\theta \left ( \mathbf{x}^{(i)} \right )\cdot \triangledown _{\theta _j}\left ( \theta ^T\mathbf{x}^{(i)} \right ) \right \end{matrix}
而:
▿θTx(i)hθ(x(i))=hθ(x(i))(1−hθ(x(i)))▽θTx(i)hθ(x(i))=hθ(x(i))(1−hθ(x(i)))
\triangledown _{\theta ^T\mathbf{x}^{(i)}}h_\theta \left ( \mathbf{x}^{(i)} \right )=h_\theta \left ( \mathbf{x}^{(i)} \right )\left ( 1-h_\theta \left ( \mathbf{x}^{(i)} \right ) \right )
▿θj(θTx(i))=x(i)j▽θj(θTx(i))=xj(i)
\triangledown _{\theta _j}\left ( \theta ^T\mathbf{x}^{(i)} \right )=x^{(i)}_j
因此,梯度的公式为:
▿θjJ(θ)=−1m∑i=1m(y(i)−hθ(x(i)))⋅x(i)j▽θjJ(θ)=−1m∑i=1m(y(i)−hθ(x(i)))⋅xj(i)
\triangledown _{\theta _j}J\left ( \theta \right )=-\frac{1}{m}\sum_{i=1}^{m}\left \left ( y^{(i)}-h_\theta \left ( \mathbf{x}^{(i)} \right ) \right )\cdot x^{(i)}_j \right
根据梯度下降法,得到如下的更新公式:
θj:=θj−α▿θjJ(θ)θj:=θj−α▽θjJ(θ)
\theta _j:=\theta _j-\alpha \triangledown _{\theta _j}J\left ( \theta \right )
Softmax是Logistic回归在多分类上的推广,即类标签yyy的取值大于等于222。假设有mmm个训练样本{(x(1),y(1)),(x(2),y(2)),⋯,(x(m),y(m))}{(x(1),y(1)),(x(2),y(2)),⋯,(x(m),y(m))}\left { \left ( \mathbf{x}^{(1)},y^{(1)} \right ),\left ( \mathbf{x}^{(2)},y^{(2)} \right ),\cdots ,\left ( \mathbf{x}^{(m)},y^{(m)} \right ) \right },对于Softmax回归,其输入特征为:x(i)∈ℜn+1x(i)∈ℜn+1\mathbf{x}^{(i)}\in \Re ^{n+1},类标记为:y(i)∈{0,1,⋯k}y(i)∈{0,1,⋯k}y^{(i)}\in \left { 0,1,\cdots k \right }。假设函数为对于每一个样本估计其所属的类别的概率p(y=j∣x)p(y=j∣x)p\left ( y=j\mid \mathbf{x} \right ),具体的假设函数为:
hθ(x(i))=⎡⎣⎢⎢⎢⎢⎢p(y(i)=1∣x(i);θ)p(y(i)=2∣x(i);θ)⋮p(y(i)=k∣x(i);θ)⎤⎦⎥⎥⎥⎥⎥=1∑kj=1eθTjx(i)⎡⎣⎢⎢⎢⎢⎢eθT1x(i)eθT2x(i)⋮eθTkx(i)⎤⎦⎥⎥⎥⎥⎥hθ(x(i))=p(y(i)=1∣x(i);θ)p(y(i)=2∣x(i);θ)⋮p(y(i)=k∣x(i);θ)=1∑j=1keθjTx(i)eθ1Tx(i)eθ2Tx(i)⋮eθkTx(i)
h_\theta \left ( \mathbf{x}^{(i)} \right )=\begin{bmatrix} p\left ( y^{(i)}=1\mid \mathbf{x}^{(i)};\theta \right )\ p\left ( y^{(i)}=2\mid \mathbf{x}^{(i)};\theta \right )\ \vdots \ p\left ( y^{(i)}=k\mid \mathbf{x}^{(i)};\theta \right ) \end{bmatrix}=\frac{1}{\sum_{j=1}^{k}e^{\theta ^T_j\mathbf{x}^{(i)}}}\begin{bmatrix} e^{\theta ^T_1\mathbf{x}^{(i)}}\ e^{\theta ^T_2\mathbf{x}^{(i)}}\ \vdots \ e^{\theta ^T_k\mathbf{x}^{(i)}} \end{bmatrix}
其中θθ\theta 表示的向量,且θi∈ℜn+1θi∈ℜn+1\theta _i\in \Re ^{n+1}。则对于每一个样本估计其所属的类别的概率为:
p(y(i)=j∣x(i);θ)=eθTjx(i)∑kl=1eθTlx(i)p(y(i)=j∣x(i);θ)=eθjTx(i)∑l=1keθlTx(i)
p\left ( y^{(i)}=j\mid \mathbf{x}^{(i)};\theta \right )=\frac{e^{\theta ^T_j\mathbf{x}^{(i)}}}{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}}
类似于Logistic回归,在Softmax的代价函数中引入指示函数I{⋅}I{⋅}I\left { \cdot \right },其具体形式为:
I{expression}={01 if expression=false if expression=trueI{expression}={0 if expression=false1 if expression=true
I\left { expression \right }=\begin{cases} 0 & \text{ if } expression=false \ 1 & \text{ if } expression=true \end{cases}
那么,对于Softmax回归的代价函数为:
J(θ)=−1m∑i=1m∑j=1kI{y(i)=j}logeθTjx(i)∑kl=1eθTlx(i)J(θ)=−1m∑i=1m∑j=1kI{y(i)=j}logeθjTx(i)∑l=1keθlTx(i)
J\left ( \theta \right )=-\frac{1}{m}\left \sum_{i=1}^{m}\sum_{j=1}^{k}I\left { y^{(i)}=j \right }log\frac{e^{\theta ^T_j\mathbf{x}^{(i)}}}{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}} \right
对于上述的代价函数,可以使用梯度下降法对其进行求解,首先对其进行求梯度:
▿θjJ(θ)=−1m∑i=1m▿θj∑j=1kI{y(i)=j}logeθTjx(i)∑kl=1eθTlx(i)▽θjJ(θ)=−1m∑i=1m▽θj∑j=1kI{y(i)=j}logeθjTx(i)∑l=1keθlTx(i)
\triangledown _{\theta _j}J\left ( \theta \right )=-\frac{1}{m}\sum_{i=1}^{m}\left \triangledown _{\theta _j}\sum_{j=1}^{k}I\left { y^{(i)}=j \right }log\frac{e^{\theta ^T_j\mathbf{x}^{(i)}}}{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}} \right
已知,对于一个样本只会属于一个类别:
▿θjJ(θ)=−1m∑mi=1▿θjlogeθTjx(i)∑kl=1eθTlx(i)=−1m∑mi=1∑kl=1eθTlx(i)eθTjx(i)⋅eθTjx(i)⋅x(i)⋅∑kl=1eθTlx(i)−eθTjx(i)⋅x(i)⋅eθTjx(i)(∑kl=1eθTlx(i))2=−1m∑mi=1∑kl=1eθTlx(i)−eθTjx(i)∑kl=1eθTlx(i)⋅x(i)▽θjJ(θ)=−1m∑i=1m▽θjlogeθjTx(i)∑l=1keθlTx(i)=−1m∑i=1m∑l=1keθlTx(i)eθjTx(i)⋅eθjTx(i)⋅x(i)⋅∑l=1keθlTx(i)−eθjTx(i)⋅x(i)⋅eθjTx(i)(∑l=1keθlTx(i))2=−1m∑i=1m∑l=1keθlTx(i)−eθjTx(i)∑l=1keθlTx(i)⋅x(i)
\begin{matrix} \triangledown _{\theta _j}J\left ( \theta \right )=-\frac{1}{m}\sum_{i=1}^{m}\left \triangledown _{\theta _j}log\frac{e^{\theta ^T_j\mathbf{x}^{(i)}}}{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}} \right \ =-\frac{1}{m}\sum_{i=1}^{m}\left \frac{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}}{e^{\theta ^T_j\mathbf{x}^{(i)}}}\cdot \frac{e^{\theta ^T_j\mathbf{x}^{(i)}}\cdot \mathbf{x}^{(i)}\cdot \sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}-e^{\theta ^T_j\mathbf{x}^{(i)}}\cdot \mathbf{x}^{(i)}\cdot e^{\theta ^T_j\mathbf{x}^{(i)}}}{\left ( \sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}} \right )^2} \right \ =-\frac{1}{m}\sum_{i=1}^{m}\left \frac{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}-e^{\theta ^T_j\mathbf{x}^{(i)}}}{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}}\cdot \mathbf{x}^{(i)} \right \end{matrix}
▿θjJ(θ)=−1m∑mi=1▿θjlogeθTj′x(i)∑kl=1eθTlx(i)=−1m∑mi=1∑kl=1eθTlx(i)eθTj′x(i)⋅−eθTj′x(i)⋅x(i)⋅eθTjx(i)(∑kl=1eθTlx(i))2=−1m∑mi=1−eθTjx(i)∑kl=1eθTlx(i)⋅x(i)▽θjJ(θ)=−1m∑i=1m▽θjlogeθj′Tx(i)∑l=1keθlTx(i)=−1m∑i=1m∑l=1keθlTx(i)eθj′Tx(i)⋅−eθj′Tx(i)⋅x(i)⋅eθjTx(i)(∑l=1keθlTx(i))2=−1m∑i=1m−eθjTx(i)∑l=1keθlTx(i)⋅x(i)
\begin{matrix} \triangledown _{\theta _j}J\left ( \theta \right )=-\frac{1}{m}\sum_{i=1}^{m}\left \triangledown _{\theta _j}log\frac{e^{\theta ^T_{{j}'}\mathbf{x}^{(i)}}}{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}} \right \ =-\frac{1}{m}\sum_{i=1}^{m}\left \frac{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}}{e^{\theta ^T_{{j}'}\mathbf{x}^{(i)}}}\cdot \frac{-e^{\theta ^T_{{j}'}\mathbf{x}^{(i)}}\cdot \mathbf{x}^{(i)}\cdot e^{\theta ^T_j\mathbf{x}^{(i)}}}{\left ( \sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}} \right )^2} \right \ =-\frac{1}{m}\sum_{i=1}^{m}\left -\frac{e^{\theta ^T_j\mathbf{x}^{(i)}}}{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}}\cdot \mathbf{x}^{(i)} \right \end{matrix}
最终的结果为:
−1m∑i=1mx(i)(I{y(i)=j}−p(y(i)=j∣x(i);θ))−1m∑i=1mx(i)(I{y(i)=j}−p(y(i)=j∣x(i);θ))
-\frac{1}{m}\sum_{i=1}^{m}\left \mathbf{x}^{(i)}\left ( I\left { y^{(i)}=j \right }-p\left ( y^{(i)}=j\mid \mathbf{x}^{(i)};\theta \right ) \right ) \right
注意,此处的θjθj\theta_j表示的是一个向量。通过梯度下降法的公式可以更新:
θj:=θj−α▿θjJ(θ)θj:=θj−α▽θjJ(θ)
\theta _j:=\theta _j-\alpha \triangledown _{\theta _j}J\left ( \theta \right )
在Softmax回归中存在着参数冗余的问题。简单来讲就是参数中有些参数是没有任何用的,为了证明这点,假设从参数向量θjθj\theta _j中减去向量ψψ\psi ,假设函数为:
p(y(i)=j∣x(i);θ)=e(θj−ψ)Tx(i)∑kl=1e(θl−ψ)Tx(i)=eθTjx(i)⋅e−ψTx(i)∑kl=1eθTlx(i)⋅e−ψTx(i)=eθTjx(i)∑kl=1eθTlx(i)p(y(i)=j∣x(i);θ)=e(θj−ψ)Tx(i)∑l=1ke(θl−ψ)Tx(i)=eθjTx(i)⋅e−ψTx(i)∑l=1keθlTx(i)⋅e−ψTx(i)=eθjTx(i)∑l=1keθlTx(i)
\begin{matrix} p\left ( y^{(i)}=j\mid \mathbf{x}^{(i)};\theta \right )=\frac{e^{(\theta _j-\psi )^T\mathbf{x}^{(i)}}}{\sum_{l=1}^{k}e^{(\theta _l-\psi )^T\mathbf{x}^{(i)}}}\ =\frac{e^{\theta ^T_j\mathbf{x}^{(i)}}\cdot e^{-\psi ^T\mathbf{x}^{(i)}}}{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}\cdot e^{-\psi ^T\mathbf{x}^{(i)}}}\ =\frac{e^{\theta ^T_j\mathbf{x}^{(i)}}}{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}} \end{matrix}
从上面可以看出从参数向量θjθj\theta _j中减去向量ψψ\psi 对预测结果并没有任何的影响,也就是说在模型中,存在着多组的最优解。
为了是算法能够尽可能简单,保留所有的参数,但是对代价函数加入权重衰减来解决参数冗余的问题,权重衰减即对参数进行正则化。
如对参数进行L2正则约束,L2正则为:
λ2∑i=1k∑j=0nθ2ijλ2∑i=1k∑j=0nθij2
\frac{\lambda }{2}\sum_{i=1}^{k}\sum_{j=0}^{n}\theta ^2_{ij}
此时,代价函数为:
J(θ)=−1m∑i=1m∑j=1kI{y(i)=j}logeθTjx(i)∑kl=1eθTlx(i)+λ2∑i=1k∑j=0nθ2ijJ(θ)=−1m∑i=1m∑j=1kI{y(i)=j}logeθjTx(i)∑l=1keθlTx(i)+λ2∑i=1k∑j=0nθij2
J\left ( \theta \right )=-\frac{1}{m}\left \sum_{i=1}^{m}\sum_{j=1}^{k}I\left { y^{(i)}=j \right }log\frac{e^{\theta ^T_j\mathbf{x}^{(i)}}}{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}} \right +\frac{\lambda }{2}\sum_{i=1}^{k}\sum_{j=0}^{n}\theta ^2_{ij}
其中,λ>0λ>0\lambda >0,此时代价函数是一个严格的凸函数。
对该函数的导数为:
▿θjJ(θ)=−1m∑i=1mx(i)(I{y(i)=j}−p(y(i)=j∣x(i);θ))+λθj▽θjJ(θ)=−1m∑i=1mx(i)(I{y(i)=j}−p(y(i)=j∣x(i);θ))+λθj
\triangledown {\theta _j}J\left ( \theta \right )=-\frac{1}{m}\sum_{i=1}^{m}\left \mathbf{x}^{(i)}\left ( I\left { y^{(i)}=j \right }-p\left ( y^{(i)}=j\mid \mathbf{x}^{(i)};\theta \right ) \right ) \right +\lambda \theta _j
Logistic回归算法是Softmax回归的特征情况,即k=2k=2k=2时的情况,当
k=2k=2k=2时,Softmax回归为:
hθ(x)=1eθT1x+eθT2xeθT1xeθT2xhθ(x)=1eθ1Tx+eθ2Txeθ1Txeθ2Tx
h_\theta \left ( x \right )=\frac{1}{e^{\theta _1^Tx}+e^{\theta _2^Tx}}\begin{bmatrix} e^{\theta _1^Tx}\ e^{\theta _2^Tx} \end{bmatrix}
利用Softmax回归参数冗余的特点,令ψ=θ1ψ=θ1\psi =\theta _1,从两个向量中都减去这个向量,得到:
hθ(x)=1e(θ1−ψ)Tx+e(θ2−ψ)Txe(θ1−ψ)Txe(θ2−ψ)Tx=⎡⎣⎢⎢11+e(θ2−θ1)Txe(θ2−θ1)Tx1+e(θ2−θ1)Tx⎤⎦⎥⎥=⎡⎣⎢⎢11+e(θ2−θ1)Tx1−11+e(θ2−θ1)Tx⎤⎦⎥⎥hθ(x)=1e(θ1−ψ)Tx+e(θ2−ψ)Txe(θ1−ψ)Txe(θ2−ψ)Tx=11+e(θ2−θ1)Txe(θ2−θ1)Tx1+e(θ2−θ1)Tx=11+e(θ2−θ1)Tx1−11+e(θ2−θ1)Tx
\begin{matrix} h_\theta \left ( \mathbf{x} \right )=\frac{1}{e^{(\theta _1-\psi )^T\mathbf{x}}+e^{(\theta _2-\psi )^T\mathbf{x}}}\begin{bmatrix} e^{(\theta _1-\psi )^T\mathbf{x}}\ e^{(\theta _2-\psi )^T\mathbf{x}} \end{bmatrix}\ =\begin{bmatrix} \frac{1}{1+e^{(\theta _2-\theta _1 )^T\mathbf{x}}}\ \frac{e^{(\theta _2-\theta _1 )^T\mathbf{x}}}{1+e^{(\theta _2-\theta _1 )^T\mathbf{x}}} \end{bmatrix}\ =\begin{bmatrix} \frac{1}{1+e^{(\theta _2-\theta _1 )^T\mathbf{x}}}\ 1-\frac{1}{1+e^{(\theta _2-\theta _1 )^T\mathbf{x}}} \end{bmatrix} \end{matrix}
上述的表达形式与Logistic回归是一致的。
有人会觉得对于一个多分类问题,可以使用多个二分类来完成,对于多分类问题是直接选择多分类的分类器还是选择多个二分类的分类器进行叠加,在UFLDL中,作者给出了这样的解释:取决于类别之间是否互斥。
对于一个多分类的问题,是直接选择多分类器直接计算还是选择多个二分类器进行计算取决于问题中类别之间是否互斥。
对于Softmax回归更多内容,包括实验可见博客简单易学的机器学习算法——Softmax Regression