Feature selection on L1 norms在L1范数下的特征选择

到不了的都叫做远方

修改于 2020-05-07 14:13:18

9170

修改于 2020-05-07 14:13:18

We're going to work with some ideas similar to those we saw in the recipe on Lasso Regression.In that recipe, we looked at the number of features that had zero coefficients.Now we're going to take this a step further and use the spareness associated with L1 norms to preprocess the features.

我们这里要学习的思想很像我们之前章节学习的Lasso回归。在这部分，我们发现很多系数为0的特征，现在我们将深入这个步骤并且使用与L1范式有关的缺失来预处理特征

Getting ready准备工作

We'll use the diabetes dataset to fit a regression. First, we'll fit a basic LinearRegression model with a ShuffleSplit cross validation. After we do that, we'll use LassoRegression to find the coefficients that are 0 when using an L1 penalty. This hopefully will help us to avoid overfitting, which means that the model is too specific to the data it was trained on. To put this another way, the model, if overfit, does not generalize well to outside data.

我们将使用糖尿病数据集来拟合一个回归模型。首先，我们拟合一个含有ShuffleSplit交叉验证的基本线性回归。做完以后，我们使用LassoRegression来找到在L1惩罚下为0的系数。这将帮助我们避免过拟合（模型训练的太过明确），如果模型过拟合，将把外来数据推向不能规范化好的另一条路上。

We're going to perform the following steps:我们将要执行以下步骤

1. Load the dataset.载入数据集

2. Fit a basic linear regression model.拟合一个基本的线性回归

3. Use feature selection to remove uninformative features.使用特征选择来移除无信息的特征。

4. Refit the linear regression and check to see how well it fits compared with the fully featured model.

重拟合线性回归并且检查它相比训练所有数据有哪些良好的表现。

How to do it...怎么做

First, let's get the dataset:首先得到数据集

import sklearn.datasets as ds
diabetes = ds.load_diabetes()

Let's create the LinearRegression object:让我们创建一个线性回归对象：

from sklearn import linear_model
lr = linear_model.LinearRegression()

Let's also import the metrics module for the mean_squared_error function and the cross_validation module for the ShuffleSplit cross validation scheme:

让我们导入metrics模型来以便使用mean_squared_error function和the cross_validation模型来进行ShuffleSplit交叉验证。

from sklearn import metrics
from sklearn import model_selection
shuff = model_selection.ShuffleSplit(diabetes.target.size)

Now, let's fit the model, and we'll keep track of the mean squared error for each iteration of ShuffleSplit :

现在，让我们拟合模型，我们将保持每一个ShuffleSplit迭代使用均方误差。

mses = []
for train, test in shuff.split(diabetes.data, diabetes.target):
    train_X = diabetes.data[train]
    train_y = diabetes.target[train]
    test_X = diabetes.data[~train]
    test_y = diabetes.target[~train]
    lr.fit(train_X, train_y)
    mses.append(metrics.mean_squared_error(test_y, lr.predict(test_X)))
np.mean(mses)
2866.381286447516

So now that we have the regular fit, let's check it after we eliminate any features with a zero for the coefficient. Let's fit the Lasso Regression:

所以现在我们进行了常规的拟合，让我们在排除了系数为0的特征后检查一下，拟合Lasso回归：

from sklearn import feature_selection
cv = linear_model.LassoCV()
cv.fit(diabetes.data, diabetes.target)
cv.coef_
array([  -0.        , -226.2375274 ,  526.85738059,  314.44026013,
       -196.92164002,    1.48742026, -151.78054083,  106.52846989,
        530.58541123,   64.50588257])

We'll remove the first feature, I'll use a NumPy array to represent the columns that are to be included in the model:

我们将移除第一个特征，我将在模型中使用Numpy数组来代替列：

import numpy as np
columns = np.arange(diabetes.data.shape[1])[cv.coef_ != 0]
columns
array([1, 2, 3 4, 5, 6, 7, 8, 9])

Okay, so now we'll fit the model with the specific features (see the columns in the following code block):

好了，我们现在拟合有特殊特征的模型（用以下代码查看列）

l1mses = []
for train, test in shuff.split(diabetes.data, diabetes.target):
    train_X = diabetes.data[train][:, columns]
    train_y = diabetes.target[train]
    test_X = diabetes.data[~train][:, columns]
    test_y = diabetes.target[~train]
    lr.fit(train_X, train_y)
    l1mses.append(metrics.mean_squared_error(test_y, lr.predict(test_X)))
np.mean(l1mses)
2871.7973251720105
np.mean(l1mses) - np.mean(mses)
5.416038724494683

As we can see, even though we get an uninformative feature, the model still fits worse. This isn't always the case. In the next section, we'll compare a fit between models where there are many uninformative features.

如我们所见，尽管我们得到一个无意义的特征，模型依然拟合的很差，这不总是事实，在下一步法，我们将比较有很多无信息特征的模型之间的拟合

How it works...如何运行的

First, we're going to create a regression dataset with many uninformative features:首先我们将生成一个有很多无信息特征的回归数据集

X, y = ds.make_regression(noise=5)

Let's fit a normal regression:让我们拟合一个普通的回归模型：

mses = []
shuff = ShuffleSplit(y.size)
for train, test in shuff.split(X, y):
    train_X = X[train]
    train_y = y[train]
    test_X = X[~train]
    test_y = y[~train]
    lr.fit(train_X, train_y)
    mses.append(metrics.mean_squared_error(test_y,lr.predict(test_X)))
np.mean(mses)
195.60396533629236

Now, we can walk through the same process for Lasso regression:现在我们能使用Lasso回归相同的步骤：

cv.fit(X, y)
LassoCV(alphas=None, copy_X=True, cv='warn', eps=0.001, fit_intercept=True,
        max_iter=1000, n_alphas=100, n_jobs=None, normalize=False,
        positive=False, precompute='auto', random_state=None,
        selection='cyclic', tol=0.0001, verbose=False)

We'll create the columns again. This is a nice pattern that will allow us to specify the features we want to include:

我们将重新生成列，这是个能让我们说明我们想要包含的特征的好的模式。

import numpy as np
columns = np.arange(X.shape[1])[cv.coef_ != 0]
columns[:5]
array([ 0,  9, 21, 24, 26])
mses = []
shuff = ShuffleSplit(y.size)
for train, test in shuff.split(X, y):
    train_X = X[train][:, columns]
    train_y = y[train]
    test_X = X[~train][:, columns]
    test_y = y[~train]
    lr.fit(train_X, train_y)
    mses.append(metrics.mean_squared_error(test_y, lr.predict(test_X)))
np.mean(mses)
9.59713146586824

As we can see, we get an extreme improvement in the fit of the model. This just exemplifies that we need to be cognizant that not all the models need to be or should be thrown into the model.

如我们所见，我们在拟合模型上得到极大的改善，这只是个典型例子，我们必须认清并不是所有的模型都要放入这个模型。

本文系外文翻译，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

线性回归

本文系外文翻译，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

线性回归

作者已关闭评论

0 条评论

热度

Feature selection on L1 norms在L1范数下的特征选择

Feature selection on L1 norms在L1范数下的特征选择

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐