Working with categorical variables处理分类变量

到不了的都叫做远方

修改于 2020-04-20 10:14:29

8050

修改于 2020-04-20 10:14:29

文章被收录于专栏：翻译scikit-learn Cookbook翻译scikit-learn Cookbook

Categorical variables are a problem. On one hand they provide valuable information; on the other hand, it's probably text—either the actual text or integers corresponding to the text—like an index in a lookup table.So, we clearly need to represent our text as integers for the model's sake, but we can't just use the id field or naively represent them. This is because we need to avoid a similar problem to the Creating binary features through thresholding recipe. If we treat data that is continuous, it must be interpreted as continuous.

分类变量是一类问题。一方面它是有价值的信息，另一方面，它可能是文本或者有对应文本信息的整数（不是实际的数，而是像一个去其他表查找的索引）。所以为了能适应我们的模型，我们需要用整数代替文本，我们不能天真的只用id代替它们，因为我们需要避免像二值特征的阈值那样划分数据，当我们处理带序列的数据时，我们需要用有序的整数代替。

Getting ready准备工作

The boston dataset won't be useful for this section. While it's useful for feature binarization,it won't suffice for creating features from categorical variables. For this, the iris dataset will suffice.For this to work, the problem needs to be turned on its head. Imagine a problem where the goal is to predict the sepal width; therefore, the species of the flower will probably be useful as a feature.

在这一节，波士顿的数据就不那么有用了，尽管它能用于二值化特征，但是它没有能够用来生成分类变量的特征。因此，iris数据集将能满足该要求，在这次准备工作中，问题将重新开始。想象新问题的目的是预测花萼的宽度，花的种类将被用于一个特征值。

Let's get the data sorted first:让我们先来获取数据集：

from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target

Now, with X and Y being as they normally will be, we'll operate on the data as one:现在，X、Y就和以前一样了，我们可以像之前那样处理数据了。

import numpy as np
d = np.column_stack((X, y))

How to do it...怎么做

Convert the text columns to three features:将相应的文本列转化为三个特征列，（不存在大小关系，所以得转成性质相等的三列）

from sklearn import preprocessing
text_encoder = preprocessing.OneHotEncoder(categories='auto')
text_encoder.fit_transform(d[:, -1:]).toarray()[:5]  # 意义在于将分类数据变成数字，但不影响距离
# 例如红黄蓝，三种颜色，若我们设置的它们的数字代码为1，2，3，便有了大小之分，会产生距离的差距，
# 因此，将红定义为[1, 0, 0]，黄[0, 1, 0]，蓝[0, 0, 1]，即可消除距离问题。
array([[ 1., 0., 0.],
       [ 1., 0., 0.],
       [ 1., 0., 0.],
       [ 1., 0., 0.],
       [ 1., 0., 0.]])

How it works...怎么做的

The encoder creates additional features for each categorical variable, and the value returned is a sparse matrix. The result is a sparse matrix by definition; each row of the new features has 0 everywhere, except for the column whose value is associated with the feature's category.Therefore, it makes sense to store this data in a sparse matrix. text_encoder is now a standard scikit-learn model, which means that it can be used again:

编码器为每一个分类变量生成额外的特征，返回值是个稀疏矩阵，结果是特定的稀疏矩阵，除了表示分类特征的列以外的其他所有列都是0，这样理解该稀疏矩阵。text_encoder现在是一个可以重复使用的scikit-learn分类标准模型。

text_encoder.transform(np.ones((3, 1))).toarray()
array([[ 0., 1., 0.],
       [ 0., 1., 0.],
       [ 0., 1., 0.]])

There's more...扩展

Other options exist to create categorical variables in scikit-learn and Python at large.DictVectorizer is a good option if you like to limit the dependencies of your projects to only scikit-learn and you have a fairly simple encoding scheme. However, if you require more sophisticated categorical encoding, patsy is a very good option.

在scikit-learn and Python还有很多用于生产分类变量的选择，如果你想只用scikit-learn来处理你的方案，特征提取是一个很好的选择，你就有了一个简单而公平的方法，然而如果你需要更深入的分类编码方法，patsy是一个好的选择.

DictVectorize特征提取

Another option is to use DictVectorizer . This can be used to directly convert strings to features:另一个选项是使用特征提取，它可以直接将字符串转化为特征。

>>> from sklearn.feature_extraction import DictVectorize
>>> dv = DictVectorizer()
>>> my_dict = [{'species': iris.target_names[i]} for i in y]
>>> dv.fit_transform(my_dict).toarray()[:5]
array([[ 1., 0., 0.],
       [ 1., 0., 0.],
       [ 1., 0., 0.],
       [ 1., 0., 0.],
       [ 1., 0., 0.]])

Dictionaries can be viewed as a sparse matrix. They only contain entries for the nonzero values.

特征抽取会被展示位一个稀疏矩阵，只有非零值有意义。

Patsy

patsy is another package useful to encode categorical variables. Often used in conjunction with StatsModels , patsy can turn an array of strings into a design matrix.

patsy是编码分类变量非常有用的另一个python包，经常和StatsModels结合一起使用，patsy能够将字符数组转换成设计好的矩阵。

This section does not directly pertain to scikit-learn; therefore,skipping it is okay without impacting the understanding of how scikit-learn works.这部分并不直接适用scikit-learn，跳过也不影响你理解如何使用scikit-learn。

For example, dm = patsy.design_matrix("x + y") will create the appropriate columns if x or y are strings. If they aren't, C(x) inside the formula will signify that it is a categorical variable.

例如，若X，Y都是字符串，dm = patsy.design_matrix("x + y") 将生成相应的列，如果不是，内置C(x)公式将默认它们的值为分类变量。

For example, iris.target can be interpreted as a continuous variable if we don't know better. Therefore, use the following command:例如，如果我们不清楚，iris.target可能会被认为是连续变量，因此，使用以下代码：

import patsy
patsy.dmatrix("0 + C(species)", {'species': iris.target}) # 0+是为了右对齐
DesignMatrix with shape (150, 3)
C(species)[0]     C(species)[1]       C(species)[2]
           1                 0                   0
           1                 0                   0
           1                 0                   0
           1                 0                   0
[...]

本文系外文翻译，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

scikit-learn