线性回归与最小二乘法 | 机器学习笔记

用户1332428

发布于 2018-03-08 13:40:33

1.1K0

发布于 2018-03-08 13:40:33

文章被收录于专栏：人工智能LeadAI

这篇笔记会将几本的线性回归概念和最小二乘法。

在机器学习中,一个重要而且常见的问题就是学习和预测特征变量(自变量)与响应的响应变量(应变量)之间的函数关系

这里主要讨论线性函数:在特征和响应之间学习线性关系。

这篇文章是入门基本概念的一片文章，会引导你关于一些模型的基本过程是怎样的。这里需要一些python和数学的基础知识。

线性建模

以奥运会男子100米金牌需要的时间数据为例。我们希望预测某年奥运会的男子拿到金牌的时间。从网上收集到数据,做成格式合适的数据集，如下图。

用python解析文件，代码如下

用python画图，代码如下

得到大致的数据分布结果：

可以看见一个近似线性的结果,因此,我们考虑用线性建模的方法来给这个数据建立模型。

1.定义模型

其中t是跑步时间,x是年份

,a是有必要但是未知的参数.

2.模型假设

假设x和t之间的关系是线性的.

这是直线的标准形式.现在学习任务就是用得到的数据为两个参数选择合适的值.

3.定义什么是最好的模型（重点）

首先,我们知道要想找到最好的模型,就是这条直线和所有的数据点都尽量的接近.

设：

表示真实数据集中第n次年份和时间.那么使得

最小,意味着数值的越接近.

根据这个引出第n个实例的平方损失函数(squard loss function):

同理,整个数据集上面的平均平方损失函数:

我们的任务就是调整模型参数的值,使得平均平方损失函数最小

（其中,argmin是数学上找到最小化参数的缩写.平方损失能够找到两个参数的最好值.）

4.最小二乘解

上面已经说过了,我们要通过平均平方损失函数来找到

的值使得我们求的的函数尽量能够接近数据集. 因此我们自然的对损失函数求最小值.

现在我们要求的便是以

为变量,其他量都是常量的函数最小值.

由多元函数的极值理论有

分别对

求偏导

在对

求偏导的时候,我们从公式中提取出只与它的偏导有关系的一部分公式,因为无关的求导必然为0,这里我们选择直接忽略去掉与

无关的项,同时,当求和式子里面某个变量与求和无关的话,可以直接提出来(把它当做常量)

同理,对

也是一样的 .

最终求偏导结果为:

让他们都为0联立起来解得:

把这个结果代到

,并且令其为0得到

而有了这两个值,那么应该有的都有了.通过数据集便能够得到一条好的拟合直线.

5.奥运会最小二乘拟合

读取数据代码

拟合算出未知参数

运行顶层脚本

画图

得到的结果为

分别模拟了男女的成绩出来的结果.(数据集都在程序文件夹中)

预测

预测就是按照我们得到的方程代入未来的数,观察其结果.你可以自己带进去一个试试。

引入向量/矩阵

1.初探

已经知道每个数据都是一些特征组成的,而一个特征就能够代表一个变量,当特征很多的时候,就是很多个变量.这时候引入矩阵是必须的,事实上,要建立这样的思维:数据=矩阵设

则有

前面的平方损失函数可以表示为

知道数据集中是由自变量数据集X和应变量数据集t组成的. 设

则有

(证明：)

对公式二进一步展开,得到

其中

互相为转置,由他们最终的结果是一个标量,标量的转置还是自己,因此,两个可以相加融合.（这个公式很重要,后面一直会用到!!!!）

2.向量/矩阵损失函数的等价性证明

以两个未知参数为例

补充1:

其中

证明:

前面我们已经得到细致的向量形式的损失函数

其中

与我们后面的微分无关,直接忽略.结合前面补充1的证明展开式子得到

（这里要注意，我们这里矩阵的下标是从0开始的，比如说

是X的第n行的第一个元素，即第n个数据对象的第一个元素。类似的

，是第二个。这里改变记法的原因是使得下标和保持一致，便于记忆）

带入我们前面的设定：

进一步化为

有

与前面标量形式出来的结果是等价的。

3.向量/矩阵损失函数的得到未知参数

上面证明了向量函数和标量函数之间得到的损失函数是等价的.事实上,我们在做数据处理的时候,都是用的矩阵来做的,计算机计算矩阵也更加方便.所以,我们直接从矩阵函数来得到未知参数.

首先引入几个矩阵微积分的公式(至于怎么来的看矩阵论)

根据上面的公式直接微分

这个公式便是求向量函数的最终公式.

这个公式非常非常重要.,同时,公式中每个量的形式(前面已经定义)也是非常非常重要的.

4.小结

有了矩阵形式的模型,后面我们便可以对任意线性模型做出预测:

其中X的一行就是一个独立的观测数据啦,一行的某列就是一个特征(变量)啦.

事实上,X这个矩阵可以扩展,以后讲这个矩阵的扩展.

5.奥运会数据的重新实现

5.1.读取数据

5.2.拟合数据

5.3.顶层测试脚本（把画图也放在了这个脚本里面）

结果:

实战

其实上面已经那么多代码就相当于是实战啦，但是要是所有的算法都自己写的话会累死的，所以这里介绍scikit-learn中的一些函数来实现相同的功能。

scikit-learn里面有处理线性模型的模块linear_model 我们这里需要的就是这个模块下面的线性回归的类啦。

详细如下：

sklearn.linear_model.LinearRegression

最小均方的线性回归的类

属性： coef_ : array类型，形状为 (n_features, ) 或者 (n_targets, n_features)，这个是表示的线性回归求出来的系数。即方程里面经常见到的w。 Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features. residues_ : array, shape (n_targets,) or (1,) or empty Sum of residuals. Squared Euclidean 2-norm for each target passed during the fit. If the linear regression problem is under-determined (the number of linearly independent rows of the training matrix is less than its number of linearly independent columns), this is an empty array. If the target vector passed during the fit is 1-dimensional, this is a (1,) shape array. New in version 0.18. intercept_ : 截距，你理解为b就行了，你懂的

方法：

__init__(fit_intercept=True, normalize=False, copy_X=True, n_jobs=1)

fit(X, y, sample_weight=None)

作用：拟合线性模型参数： X : 训练集（自变量），numpy array类型，且形状为[n_samples,n_features] y : 标签（因变量）numpy array类型，形状为 [n_samples, n_targets] sample_weight : 每个样本的权重，形状为 [n_samples]

get_params(deep=True)

Get parameters for this estimator. Parametersdeep : boolean, optional If True, will return the parameters for this estimator and contained subobjects that are estimators. Returnsparams : mapping of string to any Parameter names mapped to their values.

predict(X)

作用：利用这个线性模型来做预测参数： X :预测的数据，形状为 (n_samples, n_features) 返回： array类型,形状为 (n_samples,)

score(X, y, sample_weight=None) Returns the coefficient of determination R^2 of the prediction. The coefficient R^2 is defined as (1 - u/v), where u is the regression sum of squares ((y_true - y_pred) ** 2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0. ParametersX : array-like, shape = (n_samples, n_features) Test samples. y : array-like, shape = (n_samples) or (n_samples, n_outputs) True values for X. sample_weight : array-like, shape = [n_samples], optional Sample weights. Returnsscore : float R^2 of self.predict(X) wrt. y. 29.18. sklearn.linear_model: Generalized Linear Models 1531scikit-learn user guide, Release 0.18.1 set_params(**params) Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form __ so that it’s possible to update each component of a nested object. Returnsself :

下面就根据上面的例子贴一个新的代码：

# -*- coding: utf-8 -*- from __future__ import print_function,division from sklearn import linear_model import matplotlib.pyplot as plt import numpy as np #load data year=np.loadtxt("data1.txt",delimiter=' ',usecols=(1,),ndmin=2) score=np.loadtxt("data1.txt",delimiter=' ',usecols=(0,),ndmin=2) ''' #test print("year:\n",year) print("the shape of year:",year.shape)

print("score:\n",score) print("the shape of score:",score.shape) ''' #X=np.ones(shape=(year.shape[0],2)) #X[:,1]=year[:,0] #print("X:\n",X) #Linear Regression Lreg=linear_model.LinearRegression() Lreg.fit(year,score) print(Lreg.coef_) print(Lreg.intercept_) #prediction predict=Lreg.predict(year) #drawerplt. plot(year,score,"o") plt.plot(year,predict) plt.show()

结果：