社区首页 >专栏 >Using stochastic gradient descent for regression使用随机梯度下降进行回归分析

Using stochastic gradient descent for regression使用随机梯度下降进行回归分析

修改2020-04-20 15:07:35
修改2020-04-20 15:07:35

In this recipe, we'll get our first taste of stochastic gradient descent. We'll use it for regression here, but for the next recipe, we'll use it for classification.


Getting ready准备工作

Stochastic Gradient Descent (SGD) is often an unsung hero in machine learning.Underneath many algorithms, there is SGD doing the work. It's popular due to its simplicity and speed—these are both very good things to have when dealing with a lot of data.The other nice thing about SGD is that while it's at the core of many ML algorithms computationally, it does so because it easily describes the process. At the end of the day,we apply some transformation on the data, and then we fit our data to the model withsome loss function.


How to do it…怎么做

If SGD is good on large datasets, we should probably test it on a fairly large dataset:如果SGD善于处理大数据集,那么我们当然应该在一个真正的大数据集上测试它

from sklearn import datasets
X, y = datasets.make_regression(int(1e6)) # Just in case the 1e6 throws you off.以防它把你甩了
print "{:,}".format(int(1e6))

It's probably worth gaining some intuition about the composition and size of the object.Thankfully, we're dealing with NumPy arrays, so we can just access nbytes . The built-in Python way to access the object size doesn't work for NumPy arrays. This output be system dependent, so you may not get the same results:


print "{:,}".format(X.nbytes)

To get some human perspective, we can convert nbytes to megabytes. There are roughly 1 million bytes in an MB:


X.nbytes / 1e6

So, the number of bytes per data point is:所以每个数据点的bytes是:

X.nbytes / (X.shape[0]*X.shape[1])

Well, isn't that tidy, and fairly tangential, for what we're trying to accomplish; however,it's worth knowing how to get the size of the objects you're dealing with.So, now that we have the data, we can simply fit a SGDRegressor model:


>>> from sklearn import linear_model
>>> sgd = linear_model.SGDRegressor()
>>> train = np.random.choice([True, False], size=len(y), p=[.75, .25])
>>> sgd.fit(X[train], y[train])
SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
             eta0=0.01, fit_intercept=True, l1_ratio=0.15,
             learning_rate='invscaling', loss='squared_loss', max_iter=1000,
             n_iter_no_change=5, penalty='l2', power_t=0.25, random_state=None,
             shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,

So, we have another "beefy" object. The main thing to know now is that our loss function is squared_loss , which is the same thing that occurs during linear regression. Also worth noting is that shuffle will generate a random shuffle of the data. This is useful if you want to break a potentially spurious correlation. With fit_intercept , scikit-learn will automatically include a column of ones. If you like to see more through the output of the fitting, set verbose to 1 .

所以我们有了另一个大对象,主要是为了知道现在我们的损失函数是平方损失,这和线性回归是同一个东西。另外值得注意的是shuffle将要生成一个随机洗牌的数据,如果你想破坏潜在的虚假的关联关系,通过fit_intercept ,scikit-learn将自动生成一组1,如果你想通过拟合的输出了解更多,设置verbose为1.

We can then predict, as we previously have, using scikit-learn's consistent API:我们能够像以前那样使用scikit-learn的一贯的API进行预测

You can see we actually got a really good fit. There is barely any variation and the histogram has a nice normal look.


How it works…怎么做的

Clearly, the fake dataset we used wasn't too bad, but you can imagine datasets with large magnitudes. For example, if you worked in Wall Street on any given day, there might be two billion transactions on any given exchange in a market. Now, imagine that you have a week's or year's data. Running in-core algorithms does not work with huge volumes of data.


The reason this is normally difficult is that to do standard gradient descent, we're required to calculate the gradient at every step. The gradient has the standard definition from any third calculus course.The gist of the algorithm is that at each step we calculate a new set of coefficients and update this by a learning rate and the outcome of the objective function.


In pseudo code, this might look like the following:在假的代码里,将看起来如下所示:

while not_converged:
w = w – learning_rate*gradient(cost(w))

The relevant variables are as follows:相关变量的意义如下:

1、w : This is the coefficient matrix.这是相关系数矩阵

2、learning_rate : This shows how big a step to take at each iteration. This might be important to tune if you aren't getting a good convergence.学习率是每一次迭代中每一步的距离是多大,如果你没有得到一个好的拟合,调节它可能是非常重要的

3、gradient : This is the matrix of second derivatives.这是一个导数的矩阵

4、cost : This is the squared error for regression. We'll see later that this cost function can be adapted to work with classification tasks. This flexibility is one thing that makes SGD so useful.这是回归分析的误差的平方,我们后面将见到代价函数被适用于分类方法,使得SGD起作用的一个重要方法就是他的弹性。

This will not be so bad, except for the fact that the gradient function is expensive. As the vector of coefficients gets larger, calculating the gradient becomes very expensive. For each update step, we need to calculate a new weight for every point in the data, and then update.


The stochastic gradient descent works slightly differently; instead of the previous definition for batch gradient descent, we'll update the parameter with each new data point. This data point is picked at random, and hence the name stochastic gradient descent.



如有侵权,请联系 cloudcommunity@tencent.com 删除。


如有侵权,请联系 cloudcommunity@tencent.com 删除。

0 条评论
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档