前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >逻辑回归

逻辑回归

作者头像
故事尾音
发布2019-12-18 17:09:21
5690
发布2019-12-18 17:09:21
举报

逻辑回归模型

模型的假设:数据服从伯努利分布。

y=\sigma(f(\boldsymbol{x}))=\sigma\left(\boldsymbol{w}^{T} \boldsymbol{x}\right)=\frac{1}{1+e^{-\boldsymbol{w}^{T} \boldsymbol{x}}}

损失函数

P(y | \boldsymbol{x})=\left\{\begin{array}{c}{p, y=1} \\ {1-p, y=0}\end{array}\right.

等价于:

P\left(y_{i} | \boldsymbol{x}_{i}\right)=p^{y_{i}}(1-p)^{1-y_{i}}

如果我们采集到了一组数据一共N个,总概率为:

P =\prod_{n=1}^{N} p^{y_{n}}(1-p)^{1-y_{n}}

最大化这个概率即最小化负Log损失函数:

\begin{array}{l}{L=\sum_{n=1}^{N} \ln \left(p^{y_{n}}(1-p)^{1-y_{n}}\right)} \\ {=\sum_{n=1}^{N}\left(y_{n} \ln (p)+\left(1-y_{n}\right) \ln (1-p)\right)}\end{array}

梯度计算

上面式子中的p的公式如下:

p=\frac{1}{1+e^{-\boldsymbol{w}^{T} \boldsymbol{x}}}

1-p的公式:

1-p=\frac{e^{-\boldsymbol{w}^{T} \boldsymbol{x}}}{1+e^{-\boldsymbol{w}^{T} \boldsymbol{x}}}

p的导数如下:

p^{\prime}=p(1-p) \boldsymbol{x}

1-p的导数如下:

(1-p)^{\prime}=-p(1-p) \boldsymbol{x}

所以损失函数的梯度如下:

\begin{aligned} \nabla F(\boldsymbol{w}) &=\nabla\left(\sum_{n=1}^{N}\left(y_{n} \ln (p)+\left(1-y_{n}\right) \ln (1-p)\right)\right) \\ &=\sum\left(y_{n} \ln ^{\prime}(p)+\left(1-y_{n}\right) \ln ^{\prime}(1-p)\right) \\ &=\sum\left(\left(y_{n} \frac{1}{p} p^{\prime}\right)+\left(1-y_{n}\right) \frac{1}{1-p}(1-p)^{\prime}\right) \\ &=\sum_{N}\left(y_{n}(1-p) \boldsymbol{x}_{n}-\left(1-y_{n}\right) p \boldsymbol{x}_{n}\right) \\ &=\sum_{n=1}^{N}\left(y_{n}-p\right) \boldsymbol{x}_{n} \end{aligned}

逻辑回归的决策边界

逻辑回归的决策边界如下:

\frac{1}{1+e^{-\boldsymbol{w}^{T} \boldsymbol{x}}}=0.5

简一下上面的曲线公式,得到:

e^{-\boldsymbol{w}^{T} \boldsymbol{x}}=1=e^{0}

如下图所示,所以决策边界是线性的。

-\boldsymbol{w}^{T} \boldsymbol{x}=0

代码

逻辑回归+L2范数正则化代码

代码语言:javascript
复制
class LogisticRegression():
    """ A simple logistic regression model with L2 regularization (zero-mean
    Gaussian priors on parameters). """

    def __init__(self, x_train=None, y_train=None, x_test=None, y_test=None,
                 alpha=.1, synthetic=False):
        # Set L2 regularization strength
        self.alpha = alpha
        # Set the data.
        self.set_data(x_train, y_train, x_test, y_test)
        # Initialize parameters to zero, for lack of a better choice.
        self.betas = np.zeros(self.x_train.shape[1])

    def negative_lik(self, betas):
        return -1 * self.lik(betas)

    def lik(self, betas):
        """ Likelihood of the data under the current settings of parameters. """
        # Data likelihood
        l = 0
        for i in range(self.n):
            l += log(sigmoid(self.y_train[i] * 
                             np.dot(betas, self.x_train[i,:])))
        # Prior likelihood
        for k in range(1, self.x_train.shape[1]):
            l -= (self.alpha / 2.0) * self.betas[k]**2
        return l

    def train(self):
        """ Define the gradient and hand it off to a scipy gradient-based
        optimizer. """
        # Define the derivative of the likelihood with respect to beta_k.
        # Need to multiply by -1 because we will be minimizing.
        dB_k = lambda B, k : (k > 0) * self.alpha * B[k] - np.sum([
            self.y_train[i] * self.x_train[i, k] * 
            sigmoid(-self.y_train[i] * np.dot(B, self.x_train[i,:])) 
            for i in range(self.n)])

        # The full gradient is just an array of componentwise derivatives
        dB = lambda B : np.array([dB_k(B, k) 
                                  for k in range(self.x_train.shape[1])])
        # Optimize
        self.betas = fmin_bfgs(self.negative_lik, self.betas, fprime=dB)

    def set_data(self, x_train, y_train, x_test, y_test):
        """ Take data that's already been generated. """
        self.x_train = x_train
        self.y_train = y_train
        self.x_test = x_test
        self.y_test = y_test
        self.n = y_train.shape[0]

    def training_reconstruction(self):
        p_y1 = np.zeros(self.n)
        for i in range(self.n):
            p_y1[i] = sigmoid(np.dot(self.betas, self.x_train[i,:]))
        return p_y1

    def test_predictions(self):
        p_y1 = np.zeros(self.n)
        for i in range(self.n):
            p_y1[i] = sigmoid(np.dot(self.betas, self.x_test[i,:]))
        return p_y1

    def plot_training_reconstruction(self):
        plot(np.arange(self.n), .5 + .5 * self.y_train, 'bo')
        plot(np.arange(self.n), self.training_reconstruction(), 'rx')
        ylim([-.1, 1.1])

    def plot_test_predictions(self):
        plot(np.arange(self.n), .5 + .5 * self.y_test, 'yo')
        plot(np.arange(self.n), self.test_predictions(), 'rx')
        ylim([-.1, 1.1])

if __name__ == "__main__":
    from pylab import *
    # Create 20 dimensional data set with 25 points -- this will be
    # susceptible to overfitting.
    data = SyntheticClassifierData(25, 20)

    # Run for a variety of regularization strengths
    alphas = [0, .001, .01, .1]
    for j, a in enumerate(alphas):
        # Create a new learner, but use the same data for each run
        lr = LogisticRegression(x_train=data.X_train, y_train=data.Y_train,
                                x_test=data.X_test, y_test=data.Y_test, alpha=a)

        print "Initial likelihood:"
        print lr.lik(lr.betas)
        
        # Train the model
        lr.train()

        # Display execution info
        print "Final betas:"
        print lr.betas
        print "Final lik:"
        print lr.lik(lr.betas)

        # Plot the results
        subplot(len(alphas), 2, 2*j + 1)
        lr.plot_training_reconstruction()
        ylabel("Alpha=%s" % a)
        if j == 0:
            title("Training set reconstructions")

        subplot(len(alphas), 2, 2*j + 2)
        lr.plot_test_predictions()
        if j == 0:
            title("Test set predictions")
    show()

逻辑回归中为什么使用对数损失而不用平方损失

对于逻辑回归,这里所说的对数损失和极大似然是相同的。 不使用平方损失的原因是,在使用 Sigmoid 函数作为正样本的概率时,同时将平方损失作为损失函数,这时所构造出来的损失函数是非凸的,不容易求解,容易得到其局部最优解。 而如果使用极大似然,其目标函数就是对数似然函数,该损失函数是关于未知参数的高阶连续可导的凸函数,便于求其全局最优解。

Softmax

h_{\theta}(x)=\left[\begin{array}{c}{P(y=1 | x ; \theta)} \\ {P(y=2 | x ; \theta)} \\ {\vdots} \\ {P(y=K | x ; \theta)}\end{array}\right]=\frac{1}{\sum_{j=1}^{K} \exp \left(\theta_{j}^{T} x\right)}\left[\begin{array}{c}{\exp \left(\theta_{1}^{T} x\right)} \\ {\exp \left(\theta_{2}^{T} x\right)} \\ {\vdots} \\ {\exp \left(\theta_{K}^{T} x\right)}\end{array}\right]

代价函数:

J(\theta)=-\left[\sum_{i=1}^{n} \sum_{k=1}^{K} \mathbf{1}\left\{y^{(i)}=k\right\} \ln \frac{\exp \left(\theta_{k}^{T} x_{i}\right)}{\sum_{j=1}^{K} \exp \left(\theta_{j}^{T} x_{i}\right)}\right]

其中,1$\{x\}$是指示函数,x为真时取1否则取0。

本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2019-07-23,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 逻辑回归模型
  • 损失函数
  • 梯度计算
  • 逻辑回归的决策边界
  • 代码
    • 逻辑回归中为什么使用对数损失而不用平方损失
    • Softmax
    领券
    问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档