本文的内容来自参考书《Python机器学习基础教程》第四章数据表示与特征工程第一小节的内容
自己最浅显的理解:数学建模是基于数学表达式,数学表达式只认数字(连续变量),不认字符(分类变量);那么如何将我们收集到的数据中的字符转换成数字,科学家起了一个比较高端的名字叫做特征工程(feature engineering) 比如这一小节中使用到的示例数据:1994年美国成年人的收入,此数据集的任务是预测一名工人的收入是高于50,000美元还是低于50,000美元。数据集中的变量包括:
其中age(年龄)和hours-per-week(每周工作时长)便是连续特征;而workclass(工作类型)、education(教育程度)、gender(性别)和occupation(职业)都是分类变量。 那么如何处理这种情况,一种解决办法是使用one-hot编码(或者叫做N取一编码,也叫作虚拟变量dummy variable)。虚拟变量背后的思想就是将一个分类变量替换为一个或多个新特征,新特征取值为0,1,对于数学公式而言0,1两个值是有意义的。比如数据集
seq | gender | income | hours-per-week |
---|---|---|---|
1 | Male | 50,000 | 50 |
2 | Female | 60,000 | 40 |
经过转换就变成另外的格式
seq | Male | Female | income | hours-per-week |
---|---|---|---|---|
1 | 1 | 0 | 50,000 | 50 |
2 | 0 | 1 | 60,000 | 40 |
python中实现这种转换法的一种方式是使用pandas中的 get_dummies() 函数
接下来是重复书中的案例
使用搜索引擎搜索adult.data关键词,找到下载地址 http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data 可以选择将其复制到文本文件中,也可以选择使用python将其抓取下来,这应该是python爬虫一个非常简单的案例
from urllib.request import urlopen
html = urlopen("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data")
adult_data = html.read()
adult_data = adult_data.decode('utf-8')
fw = open('adult.data',"w",encoding = "utf-8")
fw.write(adult_data)
fw.close()
import pandas as pd
df = pd.read_csv('adult.data',header=None,index_col=False, \
names = ['age','workclass','fnlwgt','education','education-num', \
'marital-status','occupation','relationship','race','gender',\
'capital-gain','capital-loss','hours-per-week','native-country','income'])
df.head()
age workclass fnlwgt education education-num \
0 39 State-gov 77516 Bachelors 13
1 50 Self-emp-not-inc 83311 Bachelors 13
2 38 Private 215646 HS-grad 9
3 53 Private 234721 11th 7
4 28 Private 338409 Bachelors 13
marital-status occupation relationship race gender \
0 Never-married Adm-clerical Not-in-family White Male
1 Married-civ-spouse Exec-managerial Husband White Male
2 Divorced Handlers-cleaners Not-in-family White Male
3 Married-civ-spouse Handlers-cleaners Husband Black Male
4 Married-civ-spouse Prof-specialty Wife Black Female
capital-gain capital-loss hours-per-week native-country income
0 2174 0 40 United-States <=50K
1 0 0 13 United-States <=50K
2 0 0 40 United-States <=50K
3 0 0 40 United-States <=50K
4 0 0 40 Cuba <=50K
选择特定的变量
df = df[['age','workclass','education','gender','hours-per-week','occupation','income']]
检查分类数据是否存在异常使用到value_counts()函数:显示唯一值及其出现次数
for char in list(df.columns):
if df[char].dtypes == "object":
print(df[char].value_counts())
输出结果
Private 22696
Self-emp-not-inc 2541
Local-gov 2093
? 1836
State-gov 1298
Self-emp-inc 1116
Federal-gov 960
Without-pay 14
Never-worked 7
Name: workclass, dtype: int64
HS-grad 10501
Some-college 7291
Bachelors 5355
Masters 1723
Assoc-voc 1382
11th 1175
Assoc-acdm 1067
10th 933
7th-8th 646
Prof-school 576
9th 514
12th 433
Doctorate 413
5th-6th 333
1st-4th 168
Preschool 51
Name: education, dtype: int64
Male 21790
Female 10771
Name: gender, dtype: int64
Prof-specialty 4140
Craft-repair 4099
Exec-managerial 4066
Adm-clerical 3770
Sales 3650
Other-service 3295
Machine-op-inspct 2002
? 1843
Transport-moving 1597
Handlers-cleaners 1370
Farming-fishing 994
Tech-support 928
Protective-serv 649
Priv-house-serv 149
Armed-Forces 9
Name: occupation, dtype: int64
<=50K 24720
>50K 7841
Name: income, dtype: int64
可以从结果中看到workclass和occupation变量中包括 “ ?”,接下来删除包含问号的行
df = df[df['occupation'] != "?"]
df = df[df['workclass'] != "?"]
使用get_dummies()函数对分类变量进行转换
df_dummies = pd.get_dummies(df)
print("Features after get_dummies: \n", list(df_dummies.columns))
输出结果
Features after get_dummies:
['age', 'hours-per-week', 'workclass_Federal-gov', 'workclass_Local-gov', 'workclass_Private', 'workclass_Self-emp-inc', 'workclass_Self-emp-not-inc', 'workclass_State-gov', 'workclass_Without-pay', 'education_10th', 'education_11th', 'education_12th', 'education_1st-4th', 'education_5th-6th', 'education_7th-8th', 'education_9th', 'education_Assoc-acdm', 'education_Assoc-voc', 'education_Bachelors', 'education_Doctorate', 'education_HS-grad', 'education_Masters', 'education_Preschool', 'education_Prof-school', 'education_Some-college', 'gender_Female', 'gender_Male', 'occupation_Adm-clerical', 'occupation_Armed-Forces', 'occupation_Craft-repair', 'occupation_Exec-managerial', 'occupation_Farming-fishing', 'occupation_Handlers-cleaners', 'occupation_Machine-op-inspct', 'occupation_Other-service', 'occupation_Priv-house-serv', 'occupation_Prof-specialty', 'occupation_Protective-serv', 'occupation_Sales', 'occupation_Tech-support', 'occupation_Transport-moving', 'income_<=50K', 'income_>50K']
features = df_dummies.ix[:,'age':'occupation_Transport-moving'] # 这个语句不太明白
X = features.values
Y = df_dummies['income_>50K'].values
print("X.shape: {} Y.shape:{}".format(X.shape,Y.shape))
#输出
X.shape: (30718, 41) Y.shape:(30718,)
使用 ix()时遇到
C:\Users\mingy\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
"""Entry point for launching an IPython kernel.
训练模型
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,Y,random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
#输出结果
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
print("Test score:{:.2f}".format(logreg.score(X_test,y_test)))
#输出结果
Test score:0.81