首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Feature selection特征选择

Feature selection特征选择

作者头像
到不了的都叫做远方
修改2020-05-07 14:13:07
8910
修改2020-05-07 14:13:07
举报

This recipe along with the two following it will be centered around automatic feature selection. I like to think of this as the feature analogue of parameter tuning. In the same way that we cross-validate to find an appropriately general parameter, we can find an appropriately general subset of features. This will involve several different methods.

这部分含有围绕自动特征选择为中心的两个部分。我喜欢把这部分想成是特征关于参数调整的模拟。同样的方法使用交叉验证来找到一个合适的总体参数,我们能找到一个合适的总体特征的集合。这将使用几种不同的方法。

The simplest idea is univariate selection. The other methods involve working with a combination of features.

最简单的思想是单变量选择,其他方法的运行则使用特征的联合。

An added benefit to feature selection is that it can ease the burden on the data collection. Imagine that you have built a model on a very small subset of the data. If all goes well, you might want to scale up to predict the model on the entire subset of data. If this is the case, you can ease the engineering effort of data collection at that scale.

特征选择的一个附加好处是它能减轻数据收集的负担,想象一下你在数据的一个子集上建立模型。如果一切顺利,你可能想要放大到在全部数据集上预测模型,如果这是例子,在那种等级上,你能减轻数据收集时机械方面的影响

Getting ready准备工作

With univariate feature selection, scoring functions will come to the forefront again. This time,they will define the comparable measure by which we can eliminate features.In this recipe, we'll fit a regression model with a few 10,000 features, but only 1,000 points.We'll walk through the various univariate feature selection methods:

在单变量特征选择时,得分函数将再次占据最重要的位置。这时,它将定义我们排出特征时比较多尺度。在这部分,我们将拟合一个10000个特征的回归模型,但是只有1000个数据点,我们将通过单变量特征选择的方法:

from sklearn import datasets
X, y = datasets.make_regression(1000, 10000)

Now that we have the data, we will compare the features that are included with the various methods. This is actually a very common situation when you're dealing in text analysis or some areas of bioinformatics.

现在我们有了数据,我们将使用自带的多种方法来比较特征。这实际上是一个与你处理文本分析或者生物领域有很雷同的情形。

How to do it...如何做

First, we need to import the feature_selection module:

首先,我们需要导入feature_selection模块

from sklearn import feature_selection
f, p = feature_selection.f_regression(X, y)

Here, f is the f score associated with each linear model fit with just one of the features. We can then compare these features and based on this comparison, we can cull features. p is also the p value associated with that f value.

这里,f是每一个线性模型中一个特征所拟合的f得分。我们能比较这些特征和基于这些比较,我们排除特征。p是参与f值时的p值。

In statistics, the p value is the probability of a value more extreme than the current value of the test statistic. Here, the f value is the test statistic:

在统计学中,p值是在统计检验中得到比当前值更加极端的一个值的可能性。这里f值就是统计检验。

f[:5]
array([3.05618247, 0.68486343, 3.48984581, 1.67065581, 0.0524388 ])
p[:5]
array([0.08073791, 0.40811487, 0.0620392 , 0.19646989, 0.81891956])

As we can see, many of the p values are quite large. We would rather want that the p values be quite small. So, we can grab NumPy out of our tool box and choose all the p values less than .05. These will be the features we'll use for the analysis:

如我们所见,很多p值都太大了。我们可能更想要更小的p值。所以我们能把Numpy从我们的工具箱里抓出来,然后找到所有小于0.05的p值。这将是我们用于分析的特征。

import numpy as np
idx = np.arange(0, X.shape[1])
features_to_keep = idx[p < .05]
len(features_to_keep)
511

As you can see, we're actually keeping a relatively large amount of features. Depending on the context of the model, we can tighten this p value. This will lessen the number of features kept.Another option is using the VarianceThreshold object. We've learned a bit about it,but it's important to understand that our ability to fit models is largely based on the variance created by features. If there is no variance, then our features cannot describe the variation in the dependent variable. A nice feature of this, as per the documentation, is that because it does not use the outcome variable, it can be used for unsupervised cases.

如我们所见,我们实际上保留一个在特征当中相对大的值。依靠模型的背景,我们能收紧p值,这将降低保留下来的特征的数量。另外的选项时使用VarianceThreshold(方差选择)对象。我们已经学过一点,但是重要的是理解我们模型的拟合能力很大程度上是依靠特征生成的方差。如果没有方差,然后我们的特征不能描述独立变量的差值,这里很好的特征表现为由于它不依靠输出的变量,所以它能被用于无监督的例子。

We will need to set the threshold for which we eliminate features. In order to do that, we just take the median of the feature variances and supply that:

我们需要设置排除特征的门槛,为了做到这个,我们只取特征变量的中位数并应用它:

var_threshold = feature_selection.VarianceThreshold(np.median(np. var(X, axis=1)))
var_threshold.fit_transform(X).shape
(1000, 4924)

As we can see, we eliminated roughly half the features, more or less what we would expect.

如我们所见,我们排除了将近一半的特征,大于或少于看我们的需要。

How it works...如何运行

In general, all these methods work by fitting a basic model with a single feature. Depending on whether we have a classification problem or a regression problem, we can use the appropriate scoring function.

总体上说,所有的方法通过拟合一个基本的单特征来执行,依靠我们有一个分类问题或回归问题,我们能使用适应的得分函数。

Let's look at a smaller problem and visualize how feature selection will eliminate certain features. We'll use the same scoring function from the first example, but just 20 features:

我们看到一个很小的问题并可视化特征选择如何排除确定的特征的,我们将从第一个例子起使用一些得分函数,但只有20个特征。

X, y = datasets.make_regression(10000, 20) 
f, p = feature_selection.f_regression(X, y)

Now let's plot the p values of the features, we can see which feature will be eliminated and which will be kept:

现在,我们画出特征的p值,我们能看到哪个特征将被排除以及哪个将被保留:

from matplotlib import pyplot as plt
f, ax = plt.subplots(figsize=(7, 5))
ax.bar(np.arange(20), p, color='k') 
ax.set_title("Feature p values")

The output will be as follows:如下图所示 Text(0.5, 1.0, 'Feature p values')

As we can see, many of the features won't be kept, but several will be.

如我们所见,除了个别的几个以外,很多特征将不被保留。

本文系外文翻译,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文系外文翻译前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
作者已关闭评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档