专栏首页翻译scikit-learn CookbookFitting a line through data一条穿过数据的拟合直线

Fitting a line through data一条穿过数据的拟合直线

Now, we get to do some modeling! It's best to start simple; therefore, we'll look at linear regression first. Linear regression is the first, and therefore, probably the most fundamental model—a straight line through data.

现在我们开始做些模型,由简入繁,我们先了解下线性回归,线性回归是第一个也是最基础的模型,一条穿过数据的直线。

Getting ready准备工作

The boston dataset is perfect to play around with regression. The boston dataset has the median home price of several areas in Boston. It also has other factors that might impact housing prices, for example, crime rate.

波士顿数据集在回归时表现得很好,这个数据集有波士顿几个区房屋的居中价格,它也含有其他可能影响房价的因子,比如犯罪率

First, import the datasets model, then we can load the dataset:首先,载入数据集模型,然后我们载入数据。

from sklearn import datasets
boston = datasets.load_boston()

How to do it...怎么做

Actually, using linear regression in scikit-learn is quite simple. The API for linear regression is basically the same API you're now familiar with from the previous chapter.

事实上,使用scikit-learn中的线性模型非常简单,线性回归的API总的来说和你之前章节熟悉的API一样。

First, import the LinearRegression object and create an object:首先,引入LinearRegression对象然后生成一个对象。

from sklearn.linear_model import LinearRegression
lr = LinearRegression()

Now, it's as easy as passing the independent and dependent variables to the fit method of LinearRegression :

现在,很容易传入自变量和因变量来拟合LinearRegression方法:

lr.fit(boston.data, boston.target)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Now, to get the predictions, do the following:现在为了得到预测值,做如下操作:

predictions = lr.predict(boston.data)

It's then probably a good idea to look at how close the predictions are to the actual data.We can use a histogram to look at the differences. These are called the residuals, as shown:

看一看预测值和真是数据的接近情况是个不错的主意。我们能用直方图来看看其中的差异,这叫做残差,如下展示:

Let's take a look at the coefficients:让我们看一下相关系数:

lr.coef_
array([-1.08011358e-01,  4.64204584e-02,  2.05586264e-02,  2.68673382e+00,
       -1.77666112e+01,  3.80986521e+00,  6.92224640e-04, -1.47556685e+00,
        3.06049479e-01, -1.23345939e-02, -9.52747232e-01,  9.31168327e-03,
       -5.24758378e-01])

So, going back to the data, we can see which factors have a negative relationship with the outcome, and also the factors that have a positive relationship. For example, and as expected,an increase in the per capita crime rate by town has a negative relationship with the price of a home in Boston. The per capita crime rate is the first coefficient in the regression.

所以,回到数据,我们可以看看哪个因子对输出有负面的影响,哪个对输出有正面的影响。例如,和想象的一样,一个城镇的人均犯罪率对当地房价有负面的影响,人均犯罪率是第一个相关系数。

How it works...它怎么做的

The basic idea of linear regression is to find the set of coefficients of that satisfy y =X β ,where X is the data matrix. It's unlikely that for the given values of X, we will find a set of coefficients that exactly satisfy the equation; an error term gets added if there is an inexact specification or measurement error. Therefore, the equation becomes y=X β+ε , where ε is assumed to be normally distributed and independent of the X values. Geometrically, this means that the error terms are perpendicular to X. It's beyond the scope of this book, but it might be worth it to prove E(X ε)= 0 to yourself.

线性回归最基本的思想就是找到系数矩阵满足y=Xβ,X数数据矩阵,这不大可能对于给出的X的值,我们能找到一个系数集合来完全满足方程,误差会因为不准确的说明或测量误差产生,所以,方程变为y=X β+ε,假定ε是正态分布且与X值独立,在几何学上误差是与X垂直的,这超出了本书的范围,但值得你自己证明一下。

In order to find the set of betas that map the X values to y, we minimize the error term.This is done by minimizing the residual sum of squares.This problem can be solved analytically, with the solution being .

为了找到β集合映射X为y,我们最小化误差,做到这点依靠最小化残差平方和,这个问题可以通过解析以下方程解决:

β=(XT X)-XT y

There's more...扩展阅读

The LinearRegression object can automatically normalize (or scale) the inputs:

LinearRegression对象能够自动正则化或放缩输入数据

lr2 = LinearRegression(normalize=True)
lr2.fit(boston.data, boston.target)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=True)
>>> predictions2 = lr2.predict(boston.data)

原文链接:http://www.packtpub.com

原文作者:Trent Hauck

我来说两句

0 条评论
登录 后参与评论

相关文章

  • Using stochastic gradient descent for regression使用随机梯度下降进行回归分析

    In this recipe, we'll get our first taste of stochastic gradient descent. We'll ...

    到不了的都叫做远方
  • Evaluating the linear regression model评估线性回归模型

    In this recipe, we'll look at how well our regression fits the underlying data. ...

    到不了的都叫做远方
  • Regression model evaluation回归模型评估

    We learned about quantifying the error in classification, now we'll discuss quan...

    到不了的都叫做远方
  • Python Decorators

    http://python-3-patterns-idioms-test.readthedocs.io/en/latest/PythonDecorators.h...

    py3study
  • 机器学习实战(二) - 单变量线性回归Model and Cost Function1 模型概述 - Model Representation2 代价函数 - Cost Function3 代价函数(

    To establish notation for future use, we’ll use

    公众号-JavaEdge
  • Palabos Tutorial 2/3:Understanding the multi-block structure

    The code structure of Palabos programs is driven by the duality between atomic-b...

    周星星9527
  • log4j conversion pattern各格式含义

    Dylan Liu
  • Oops错误

    在at91rm9200下写了一个spi的驱动,加载后,运行测试程序时,蹦出这么个吓人的东西: Unable to handle kernel paging r...

    一见
  • Java开发人员常用的服务配置(Nginx、Tomcat、JVM、Mysql、Redis)

    happyJared
  • RFC2616-HTTP1.1-Methods(方法规定部分—单词注释版)

    zaking

扫码关注云+社区

领取腾讯云代金券