Using stochastic gradient descent for regression使用随机梯度下降进行回归分析

到不了的都叫做远方

修改于 2020-04-20 15:07:35

61500

代码可运行

文章被收录于专栏：翻译scikit-learn Cookbook翻译scikit-learn Cookbook

运行总次数：0

代码可运行

In this recipe, we'll get our first taste of stochastic gradient descent. We'll use it for regression here, but for the next recipe, we'll use it for classification.

在这部分，我们将初尝随机梯度下降，在这里，我们将把它用于回归问题，但是在后面的部分，我们将把它用于分类问题

Getting ready准备工作

Stochastic Gradient Descent (SGD) is often an unsung hero in machine learning.Underneath many algorithms, there is SGD doing the work. It's popular due to its simplicity and speed—these are both very good things to have when dealing with a lot of data.The other nice thing about SGD is that while it's at the core of many ML algorithms computationally, it does so because it easily describes the process. At the end of the day,we apply some transformation on the data, and then we fit our data to the model withsome loss function.

随机梯度下降SGD在机器学习中常是被埋没的英雄。隐藏在很多算法下面的都是其在工作。它的流行源于它的简单高速，在处理大量数据时，它们都是很棒的东西。另一件很棒的事情是它经常处于大量机器学习算法的核心，这是因为它很容易来描述其过程，在本章的最后，我们在数据上运用些转换，然后拟合数据时加上些损失函数。

How to do it…怎么做

If SGD is good on large datasets, we should probably test it on a fairly large dataset:如果SGD善于处理大数据集，那么我们当然应该在一个真正的大数据集上测试它

from sklearn import datasets
X, y = datasets.make_regression(int(1e6)) # Just in case the 1e6 throws you off.以防它把你甩了
print "{:,}".format(int(1e6))
1,000,000

It's probably worth gaining some intuition about the composition and size of the object.Thankfully, we're dealing with NumPy arrays, so we can just access nbytes . The built-in Python way to access the object size doesn't work for NumPy arrays. This output be system dependent, so you may not get the same results:

通过对象的这些成和大小将值得收获一些经验，幸运的是，我们处理Numpy数组，所以我们只用使用nbytes，Python的内置方法来获取对象的大小对于numpy数组行不通，这个输出依赖独立系统，所以你不一定得到同样的结果。

print "{:,}".format(X.nbytes)
800,000,000

To get some human perspective, we can convert nbytes to megabytes. There are roughly 1 million bytes in an MB:

做一些人为处理，我们转换nbytes为megabytes，1MB将近是100万bytes

X.nbytes / 1e6
800.0

So, the number of bytes per data point is:所以每个数据点的bytes是：

X.nbytes / (X.shape[0]*X.shape[1])
8

Well, isn't that tidy, and fairly tangential, for what we're trying to accomplish; however,it's worth knowing how to get the size of the objects you're dealing with.So, now that we have the data, we can simply fit a SGDRegressor model:

好了这不是很整洁，相当合适，看起来我们已经完成了，无论如何，在我们操作之前了解一下对象的大小很有必要，所以现在我们有了数据，让我们简单的拟合一下SGDRegressor模型：

>>> from sklearn import linear_model
>>> sgd = linear_model.SGDRegressor()
>>> train = np.random.choice([True, False], size=len(y), p=[.75, .25])
>>> sgd.fit(X[train], y[train])
SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
             eta0=0.01, fit_intercept=True, l1_ratio=0.15,
             learning_rate='invscaling', loss='squared_loss', max_iter=1000,
             n_iter_no_change=5, penalty='l2', power_t=0.25, random_state=None,
             shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,
             warm_start=False)

So, we have another "beefy" object. The main thing to know now is that our loss function is squared_loss , which is the same thing that occurs during linear regression. Also worth noting is that shuffle will generate a random shuffle of the data. This is useful if you want to break a potentially spurious correlation. With fit_intercept , scikit-learn will automatically include a column of ones. If you like to see more through the output of the fitting, set verbose to 1 .

所以我们有了另一个大对象，主要是为了知道现在我们的损失函数是平方损失，这和线性回归是同一个东西。另外值得注意的是shuffle将要生成一个随机洗牌的数据，如果你想破坏潜在的虚假的关联关系，通过fit_intercept ，scikit-learn将自动生成一组1，如果你想通过拟合的输出了解更多，设置verbose为1.

We can then predict, as we previously have, using scikit-learn's consistent API:我们能够像以前那样使用scikit-learn的一贯的API进行预测

You can see we actually got a really good fit. There is barely any variation and the histogram has a nice normal look.

我们能看到我们确实得到了很好的拟合，这仅仅有一点差异，直方图看起来有漂亮的正态分布。

How it works…怎么做的

Clearly, the fake dataset we used wasn't too bad, but you can imagine datasets with large magnitudes. For example, if you worked in Wall Street on any given day, there might be two billion transactions on any given exchange in a market. Now, imagine that you have a week's or year's data. Running in-core algorithms does not work with huge volumes of data.

清楚的看到，我们使用的虚拟的数据集表现不赖，但你可以想象大量的数据，例如你在华尔街的任何一天，在一个市场中可能会有20亿比交易，现在想象一下一周或一年的数据，这么大量的数据，算法的内核都无法工作。

The reason this is normally difficult is that to do standard gradient descent, we're required to calculate the gradient at every step. The gradient has the standard definition from any third calculus course.The gist of the algorithm is that at each step we calculate a new set of coefficients and update this by a learning rate and the outcome of the objective function.

做SGD产生困难很正常，原因是我么需要计算每一步的梯度，梯度在任何三个方向的微积分里都有定义，算法的主旨是在每一步我们计算一个新的系数集合然后更新它的学习率，然后输出目标函数。

In pseudo code, this might look like the following:在假的代码里，将看起来如下所示：

while not_converged:
w = w – learning_rate*gradient(cost(w))

The relevant variables are as follows:相关变量的意义如下：

1、w : This is the coefficient matrix.这是相关系数矩阵

2、learning_rate : This shows how big a step to take at each iteration. This might be important to tune if you aren't getting a good convergence.学习率是每一次迭代中每一步的距离是多大，如果你没有得到一个好的拟合，调节它可能是非常重要的

3、gradient : This is the matrix of second derivatives.这是一个导数的矩阵

4、cost : This is the squared error for regression. We'll see later that this cost function can be adapted to work with classification tasks. This flexibility is one thing that makes SGD so useful.这是回归分析的误差的平方，我们后面将见到代价函数被适用于分类方法，使得SGD起作用的一个重要方法就是他的弹性。

This will not be so bad, except for the fact that the gradient function is expensive. As the vector of coefficients gets larger, calculating the gradient becomes very expensive. For each update step, we need to calculate a new weight for every point in the data, and then update.

一切都不是太坏，除了梯度函数的因子很难算以外，当相关向量变得越来越大，计算梯度变得非常昂贵，每一步更新，我们需要为每个数据点计算新的权重，然后更新它。

The stochastic gradient descent works slightly differently; instead of the previous definition for batch gradient descent, we'll update the parameter with each new data point. This data point is picked at random, and hence the name stochastic gradient descent.

随机梯度下降方法工作起来稍显不同，于之前一批梯度距离不同的是，我们为每一个新数据点更新参数，但这个数据点将会随机选择，所以名字叫随机梯度下降。

本文系外文翻译，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

scikit-learn