前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Taking a more fundamental approach to regularization with LARS

Taking a more fundamental approach to regularization with LARS

作者头像
到不了的都叫做远方
修改2020-04-21 18:06:22
3190
修改2020-04-21 18:06:22
举报

使用LARS方法实现更基本方法来近似正则化

To borrow from Gilbert Strang's evaluation of the Gaussian elimination, LARS is an idea you probably would've considered eventually had it not been discovered previously by Efron, Hastie, Johnstone, and Tibshiriani in their works[1].

从Gilbert Strang的高斯消元法评估借鉴而来,LARS是一个由Efron, Hastie, Johnstone, and Tibshiriani在工作中发现的,你没有发现在不知不觉中就被考虑到的方法。

Getting ready准备工作

Least-angle regression (LARS) is a regression technique that is well suited for high-dimensional problems, that is, p >> n, where p denotes the columns or features and n is the numbe of samples.

最小角回归是适合高维问题的一种回归技术,那是当p>>n,p表示列或特征,n表示样本数量

How to do it...它是怎么做的

First, import the necessary objects. The data we use will have 200 data points and 500 features. We'll also choose a low noise and a small number of informative features:

首先,导入必要的对象,数据将由200条数据含500个特征。我们将挑选含有少量噪声并且只有少量有用的特征:

代码语言:javascript
复制
from sklearn.datasets import make_regression
reg_data,reg_target=make_regression(n_samples=200,n_features=500,n_informative=10,noise=2)

Since we used 10 informative features, let's also specify that we want 10 nonzero coefficients in LARS. We will probably not know the exact number of informative features beforehand, but it's useful for learning purposes:

当我们使用10个有用特征,我们将规定我们在LARS中有10个非0系数。我们可能提前不知道具体的有意义的信息,但是这对学习目标来说很有意义。

代码语言:javascript
复制
from sklearn.linear_model import Lars
lars = Lars(n_nonzero_coefs=10)
lars.fit(reg_data, reg_target)

We can then verify that LARS returns the correct number of nonzero coefficients:

我们能检查LARS返回的正确的非0系数

代码语言:javascript
复制
np.sum(lars.coef_ != 0)
10

The question then is why it is more useful to use a smaller number of features. To illustrate this, let's hold out half of the data and train two LARS models, one with 12 nonzero coefficients and another with no predetermined amount. We use 12 here because we might have an idea of the number of important features, but we might not be sure of the exact number:

接下来的问题是为什么使用很小的数据更有用。为了解释这个问题,让我们把数据分为两半,然后训练2个LARS模型,一个有12个非0系数,另一个事先未确定系数的数量。我们在这使用12是因为我们大概知道有多少个重要的特征,但是我们不能确定的知道它的数量。

代码语言:javascript
复制
train_n = 100
lars_12 = Lars(n_nonzero_coefs=12)
lars_12.fit(reg_data[:train_n], reg_target[:train_n])
lars_500 = Lars() # it's 500 by default
lars_500.fit(reg_data[:train_n], reg_target[:train_n]);

Now, to see how well each feature fit the unknown data, do the following:现在为了看到每个特征拟合未知数据的效果,让我们这么做:

代码语言:javascript
复制
np.mean(np.power(reg_target[train_n:] - lars_12.predict(reg_data[train_n:]), 2))
2980.1317852867755
np.mean(np.power(reg_target[train_n:] - lars_500.predict(reg_data[train_n:]), 2))
9.6198147535136237e+30

Look again if you missed it; the error on the test set was clearly very high. Herein lies the problem with high-dimensional datasets; given a large number of features, it's typically not too difficult to get a model of good fit on the train sample, but overfitting becomes a huge problem.

如果忘记了,就再看一遍。测试集的误差明显非常高,这是高维数据的保留问题。给定一个较大的特征数量,它确实不难得到在训练集上拟合很好的模型,但是过拟合会成为一个大问题。

So, we move along x1 until we get to the point where the pull on x1 by y is the same as the pull on x2 by y. When this occurs, we move along the path that is equal to the angle between x1 and x2 divided by 2.

所以我们沿着x1移动直到我们得到两向量的值在同一条平行于y轴的位置。然后,我们沿着路径,转过两直线夹角的一半。

There's more...扩展阅读

Much in the same way we used cross-validation to tune ridge regression, we can do the same with LARS:

和我们之前用过的交叉验证来调试岭回归的方法一样,我们可以这样调整LARS:

代码语言:javascript
复制
from sklearn.linear_model import LarsCV
lcv = LarsCV()
lcv.fit(reg_data, reg_target)

Using cross-validation will help us determine the best number of nonzero coefficients to use.Here, it turns out to be as shown:用交叉验证将要帮我们决定可使用的非零系数的最佳数量,将如下所示:

代码语言:javascript
复制
np.sum(lcv.coef_ != 0)
28

[1]: Efron, Bradley; Hastie, Trevor; Johnstone, Iain and Tibshirani, Robert(2004). "Least Angle Regression". Annals of Statistics 32(2): pp. 407–499. doi:10.1214/009053604000000067. MR 2060166.

本文系外文翻译,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文系外文翻译前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
作者已关闭评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档