机器学习| 第三周：数据表示与特征工程

机器视觉CV

发布于 2019-07-15 19:31:16

1.6K0

发布于 2019-07-15 19:31:16

文章被收录于专栏：机器视觉CV

本节主要内容：

介绍特征工程在机器学习中的作用以及常见的几种特征工程处理方法

1. 特征工程在机器学习中的作用

特征工程的作用主要针对以下几类问题：

在机器学习中，较常见的特征都是数值型的特征，但是某些非数值特征（也叫离散特征）往往也包含着重要的信息
对某些数值特征进行缩放在机器学习也是常见的，重要的
在数据集中，有些特征表达的信息可能不够，对其中一些特征进行扩充，也是有好处的（比如添加特征的交互项（乘积）或多项式）

一句话，特征工程就是在你进行机器学习时，如何对数据进行初步处理、整合才能使模型的性能达到最佳。

2. 常见的特征工程处理方法

2.1 分类变量

当数据中有一些非数值时，即离散特征，需要对其进行量化处理。

(1) One-Hot编码（虚拟变量）

到目前为止，表示分类变量最常用的方法就是使用 one-hot 编码（one-hot-encoding）或 N 取一编码（one-out-of-N encoding），也叫虚拟变量（dummy variable）。虚拟变量背后的思想是将一个分类变量替换为一个或多个新特征，新特征取值为 0 和 1 。如下图，是用来预测某个人的收入是大于 50K 还是小于 50K 的部分数据集。其中，只有 age 和 hour-per-week 特征是数值数据，其他则为非数值数据，编码就是要对这些非数值数据进行数值编码。将数据转换为分类变量的 one-hot 编码有两种方法：一种是使用 pandas，一种是使用 scikit-learn 。 pandas 使用起来会简单一点，故本文使用的是 pandas 方法。

读取数据

1import pandas as pd
2# 文件中没有包含列名称的表头，因此我们传入header=None
3# 然后在"names"中显式地提供列名称
4data = pd.read_csv("adult.data", header=None, names=['age', 'workclass', 'fnlwgt', 'education', 'education-num',
5                                                     'marital-status', 'occupation', 'relationship', 'race', 'gender',
6                                                     'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
7                                                     'income'])
8# 为了便于说明，我们只选了其中几列
9data = data[['age', 'workclass', 'education', 'gender', 'hours-per-week', 'occupation', 'income']]

读取完数据集后，最好检查一下每一列是否包含有意义的分类数据。检查列的内容有一个好方法，就是使用 pandas Series（Series 是 DataFrame 中单列对应的数据类型）的 value_counts 函数，以显示唯一值及其出现次数：

1print(data.gender.value_counts())

例如，我们查看数据中的 gender 特征中 Male 和 Female 个数各有多少。输出： Male 21790 Female 10771 Name: gender, dtype: int64

用 pandas 编码数据有一种非常简单的方法，就是使用 get_dummies 函数。 get_dummies 函数自动变换所有具有对象类型（比如字符串）的列或所有分类的列。以下对比编码前后数据进行比较：

1print("Original features:\n", list(data.columns), "\n") 
2data_dummies = pd.get_dummies(data) 
3print("Features after get_dummies:\n", list(data_dummies.columns))

编码前: Original features: ['age', 'workclass', 'education', 'gender', 'hours-per-week', 'occupation', 'income'] 编码后: Features after get_dummies: ['age', 'hours-per-week', 'workclass_ ?', 'workclass_ Federal-gov', 'workclass_ Local-gov', 'workclass_ Never-worked', 'workclass_ Private', 'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc', 'workclass_ State-gov', 'workclass_ Without-pay', 'education_ 10th', 'education_ 11th', 'education_ 12th', 'education_ 1st-4th', 'education_ 5th-6th', 'education_ 7th-8th', 'education_ 9th', 'education_ Assoc-acdm', 'education_ Assoc-voc', 'education_ Bachelors', 'education_ Doctorate', 'education_ HS-grad', 'education_ Masters', 'education_ Preschool', 'education_ Prof-school', 'education_ Some-college', 'gender_ Female', 'gender_ Male', 'occupation_ ?', 'occupation_ Adm-clerical', 'occupation_ Armed-Forces', 'occupation_ Craft-repair', 'occupation_ Exec-managerial', 'occupation_ Farming-fishing', 'occupation_ Handlers-cleaners', 'occupation_ Machine-op-inspct', 'occupation_ Other-service', 'occupation_ Priv-house-serv', 'occupation_ Prof-specialty', 'occupation_ Protective-serv', 'occupation_ Sales', 'occupation_ Tech-support', 'occupation_ Transport-moving', 'income_ <=50K', 'income_ >50K'] 看一下数据是如何表示的：

下面将数据转化为 Numpy 数组，训练一个机器学习模型。注意要把目标变量分离出来（本来 imcome 是一列的，现在经过虚拟变量处理以后变成了两列）。同时，注意：pandas 中的列索引是包括范围的结尾的，Numpy 的切片是不包括范围的结尾的。

1features = data_dummies.loc[:,'age':'occupation_ Transport-moving']
2# 提取 Numpy 数组
3X = features.values
4y = data_dummies['income_ >50K'].values
5print("X.shape: {} y.shape: {}".format(X.shape, y.shape))

输出： X.shape: (32561, 44) y.shape: (32561,)

用逻辑回归模型进行预测

1from sklearn.model_selection import train_test_split
2from sklearn.linear_model import LogisticRegression
3X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
4logreg = LogisticRegression()
5logreg.fit(X_train, y_train)
6print("Test score: {:.2f}".format(logreg.score(X_test, y_test)))

输出： Test score: 0.81 在同时包含训练数据点和测试数据点的数据框上调用 get_dummies，也可以确保调用 get_dummies 后训练集和测试集的列名称相同，以保证它们具有相同的语义。

总结：对非数值数据进行编码是机器学习中一个非常重要的内容，对于 One-hot 来进行编码时，可以考虑以下步骤：

读取数据，设置为 pandas 的 DataFrame 格式
对数据进行初次识别，看哪些是数值特征，哪些是非数值特征，对于非数值特征，可以用 value_counts 函数查看该非数值特征每个类别出现的个数有多少。格式是： data.feature.value_counts —— data: 是 DataFrame 的数据结构， feature：是某一个特征的名称。
对数据进行编码，使用的是 pd.get_dummies(data) 函数， data:是 DataFrame 数据结构。注意：同时对训练集和测试集进行编码
选择模型进行学习。

(2) 数字可以编码分类变量

分类特征通常用整数进行编码。它们是数字并不意味着它们必须被视为连续特征。一个整数特征应该被视为连续的还是离散的（one-hot 编码的），有时并不明确。如果在被编码的语义之间没有顺序关系（比如 workclass 的例子），那么特征必须被视为离散特征。对于其他情况（比如五星评分），哪种编码更好取决于具体的任务和数据，以及使用哪种机器学习算法。

pandas 的 get_dummies 函数将所有数字看作是连续的，不会为其创建虚拟变量。为了解决这个问题，你可以使用 scikit-learn 的 OneHotEncoder，指定哪些变量是连续的、哪些变量是离散的，你也可以将数据框中的数值列转换为字符串。介绍一个例子：

1# 创建一个DataFrame，包含一个整数特征和一个分类字符串特征
2demo_df = pd.DataFrame({'Integer Feature': [0, 1, 2, 1],
3                        'Categorical Feature': ['socks', 'fox', 'socks', 'box']})
4display(demo_df)

1# 使用 get_dummies() 只会对非数值特征进行编码， 整数特征不变
2pd.get_dummies(demo_df)

输出：

1# 把数值特征转化为字符特征
2demo_df['Integer Feature'] = demo_df['Integer Feature'].astype(str)
3pd.get_dummies(demo_df, columns=['Integer Feature', 'Categorical Feature'])

输出：

2.2 交互特征与多项式特征

想要丰富特征表示，特别是对于线性模型而言，另一种方法是添加原始数据的交互特征（interaction feature）和多项式特征（polynomial feature）。

2.3 自动化特征选择

在添加新特征或处理一般的高维数据集时，最好将特征减少到只包含最有用的特征，并删除其余特征，这样会得到泛化能力更好，更简单的模型。如何判断每个特征的作用程度：（监督方法）

单变量统计

考虑单个变量（特征）与目标值之间是否存在统计显著性，然后选择具有最高置信度的特征。其中计算阈值的方法各有不同，最简单的是 SelectKBest 和 SelectPercentile，前者选择固定数量的 k 个特征，后者选择固定百分比的特征。以下用 sklearn 数据集中的癌症数据。

 1from sklearn.datasets import load_breast_cancer
 2from sklearn.feature_selection import  SelectPercentile
 3from sklearn.model_selection import train_test_split
 4cancer = load_breast_cancer()
 5
 6# 获得确定的随机数
 7rng = np.random.RandomState(42)
 8noise = rng.normal(size=(len(cancer.data), 50))
 9# 向数据中添加噪声
10# 前 30 个特征来自数据集，后 50 个是噪声
11X_w_noise = np.hstack([cancer.data, noise])
12
13X_train, X_test, y_train, y_test = train_test_split(X_w_noise, cancer.target, random_state=0, test_size=.5)
14# 使用 f_classif( 默认值 )和 SelectPercentile 来选择 50% 的特征
15select = SelectPercentile(percentile=50)
16select.fit(X_train, y_train)
17# 对训练集进行变换
18X_train_selected = select.transform(X_train)
19
20
21print("X_train.shape: {}".format(X_train.shape))
22print("X_train_selected.shape: {}".format(X_train_selected.shape))

输出： X_train.shape: (284, 80) X_train_selected.shape: (284, 40) 特征的数量从 80 减少到 40（原始特征数量的 50%）。我们可以用 get_support 方法来查看哪些特征被选中，它会返回所选特征的布尔遮罩（mask）

1mask = select.get_support()
2print(mask)
3# 将遮罩可视化——黑色为 True 白色为 False
4plt.matshow(mask.reshape(1, -1), cmap='gray_r')
5plt.xlabel("Sample index")

输出: [ True True True True True True True True True False True False True True True True True True False False True True True True True True True True True True False False False True False True False False True False False False False True False False True False False True False True False False False False False False True False True False False False False True False True False False False False True True False True False False False False]

遮罩可以从遮罩的可视化中看出，大多数所选择的特征都是原始特征，并且大多数噪声特征都已被删除。

比较 Logistic 回归在所有特征上的性能与仅使用所选特征的性能

 1from sklearn.linear_model import LogisticRegression
 2
 3# 对测试数据进行变换
 4X_test_selected = select.transform(X_test)
 5
 6lr = LogisticRegression()
 7lr.fit(X_train, y_train)
 8print("Score with all feature:{:.3f}".format(lr.score(X_test, y_test)))
 9
10lr.fit(X_train_selected, y_train)
11print("Score with only selected features: {:.3f}".format(lr.score(X_test_selected, y_test)))

输出: Score with all feature:0.930 Score with only selected features: 0.940

删除噪声特征可以提高性能，即使丢失了某些原始特征。这是一个非常简单的假想示例，在真实数据上的结果要更加复杂。不过，如果特征量太大以至于无法构建模型，或者你怀疑许多特征完全没有信息量，那么单变量特征选择还是非常有用的。

基于模型的选择描述：在选定了一个监督学习的模型下来判断每个特征的重要性，并且保留最重要的特征。基于模型的特征选择，我们需要使用 SelectFromModel 变换器：

1# 使用的数据是癌症数据
2from sklearn.feature_selection import SelectFromModel
3from sklearn.ensemble import RandomForestClassifier
4select = SelectFromModel(RandomForestRegressor(n_estimators=100,random_state=42), threshold="median")
5select.fit(X_train, y_train)
6X_train_l1 = select.transform(X_train)
7print("X_train.shape: {}".format(X_train.shape))
8print("X_train_l1.shape: {}".format(X_train_l1.shape))

输出:选择一半的特征 X_train.shape: (284, 80) X_train_l1.shape: (284, 40)

1# 查看选中的特征
2mask = select.get_support()
3print(mask)
4# 将遮罩可视化——黑色为True，白色为False
5plt.matshow(mask.reshape(1, -1), cmap='gray_r')
6plt.xlabel("Sample index")

[ True True True True False False True True False False True False False True False True True True True True True True True True True True True True True False False True False False True False False False False False True True False False True False False False False False True False False True True False False True True False False True False False False True True True False True True True False False False False False True False False]

遮罩

1X_test_l1 = select.transform(X_test)
2score = LogisticRegression().fit(X_train_l1, y_train).score(X_test_l1, y_test)
3print("Test score: {:.3f}".format(score))

输出: Test score: 0.944 利用更好的特征选择，性能也得到了提高

迭代选择描述：选用一个模型，确定所需要特征的个数，运行期间会从原始特征中删除一个特征，直到所需特征数

1from sklearn.feature_selection import RFE
2select = RFE(RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=40)
3
4select.fit(X_train, y_train)
5# 将选中的特征可视化
6mask = select.get_support()
7plt.matshow(mask.reshape(1, -1), cmap='gray_r')
8plt.xlabel("Sample index")

运行上述代码需要的时间也比基于模型的选择长得多，因为对一个随机森林模型训练了 40 次，每运行一次删除一个特征。

1X_train_rfe = select.transform(X_train)
2X_test_rfe = select.transform(X_test)
3
4score = LogisticRegression().fit(X_train_rfe, y_train).score(X_test_rfe, y_test)
5print("LogisticRegression Test score: {:.3f}".format(score))
6print("RFE Test score: {:.3f}".format(select.score(X_test, y_test)))

输出: LogisticRegression Test score: 0.951 RFE Test score: 0.951 在 RFE 内部使用的随机森林的性能，与在所选特征上训练一个 Logistic 回归模型得到的性能相同。换句话说，只要我们选择了正确的特征，线性模型的表现就与随机森林一样好