建模过程中分类变量的处理（笔记一）

用户7010445

发布于 2020-03-03 14:25:09

2K0

发布于 2020-03-03 14:25:09

本文的内容来自参考书《Python机器学习基础教程》第四章数据表示与特征工程第一小节的内容

自己最浅显的理解：数学建模是基于数学表达式，数学表达式只认数字（连续变量），不认字符（分类变量）；那么如何将我们收集到的数据中的字符转换成数字，科学家起了一个比较高端的名字叫做特征工程（feature engineering）比如这一小节中使用到的示例数据：1994年美国成年人的收入，此数据集的任务是预测一名工人的收入是高于50,000美元还是低于50,000美元。数据集中的变量包括：

age
workclass
educatiuon
gender
hours-per-week
occupation
income

其中age(年龄)和hours-per-week(每周工作时长)便是连续特征；而workclass（工作类型）、education(教育程度)、gender(性别)和occupation(职业)都是分类变量。那么如何处理这种情况，一种解决办法是使用one-hot编码（或者叫做N取一编码，也叫作虚拟变量dummy variable）。虚拟变量背后的思想就是将一个分类变量替换为一个或多个新特征，新特征取值为0，1，对于数学公式而言0，1两个值是有意义的。比如数据集

seq	gender	income	hours-per-week
1	Male	50,000	50
2	Female	60,000	40

经过转换就变成另外的格式

seq	Male	Female	income	hours-per-week
1	1	0	50,000	50
2	0	1	60,000	40

python中实现这种转换法的一种方式是使用pandas中的 get_dummies() 函数

接下来是重复书中的案例

第一步：下载数据集

使用搜索引擎搜索adult.data关键词，找到下载地址 http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data 可以选择将其复制到文本文件中，也可以选择使用python将其抓取下来，这应该是python爬虫一个非常简单的案例

python抓取代码

from urllib.request import urlopen
html = urlopen("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data")
adult_data = html.read()
adult_data = adult_data.decode('utf-8')
fw = open('adult.data',"w",encoding = "utf-8")
fw.write(adult_data)
fw.close()

参考文献

https://blog.csdn.net/xman4code/article/details/80989601
https://www.jianshu.com/p/cfbdacbeac6e

第二步：数据处理与建模

import pandas as pd
df = pd.read_csv('adult.data',header=None,index_col=False, \
names = ['age','workclass','fnlwgt','education','education-num', \
'marital-status','occupation','relationship','race','gender',\
'capital-gain','capital-loss','hours-per-week','native-country','income'])
df.head()

输出结果

age         workclass  fnlwgt  education  education-num  \
0   39         State-gov   77516  Bachelors             13   
1   50  Self-emp-not-inc   83311  Bachelors             13   
2   38           Private  215646    HS-grad              9   
3   53           Private  234721       11th              7   
4   28           Private  338409  Bachelors             13   

       marital-status         occupation   relationship   race  gender  \
0       Never-married       Adm-clerical  Not-in-family  White    Male   
1  Married-civ-spouse    Exec-managerial        Husband  White    Male   
2            Divorced  Handlers-cleaners  Not-in-family  White    Male   
3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male   
4  Married-civ-spouse     Prof-specialty           Wife  Black  Female   

   capital-gain  capital-loss  hours-per-week native-country income  
0          2174             0              40  United-States  <=50K  
1             0             0              13  United-States  <=50K  
2             0             0              40  United-States  <=50K  
3             0             0              40  United-States  <=50K  
4             0             0              40           Cuba  <=50K

选择特定的变量

df = df[['age','workclass','education','gender','hours-per-week','occupation','income']]

检查分类数据是否存在异常使用到value_counts()函数：显示唯一值及其出现次数

for char in list(df.columns):
    if df[char].dtypes == "object":
        print(df[char].value_counts())

输出结果

Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64
HS-grad         10501
Some-college     7291
Bachelors        5355
Masters          1723
Assoc-voc        1382
11th             1175
Assoc-acdm       1067
10th              933
7th-8th           646
Prof-school       576
9th               514
12th              433
Doctorate         413
5th-6th           333
1st-4th           168
Preschool          51
Name: education, dtype: int64
Male      21790
Female    10771
Name: gender, dtype: int64
Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
?                    1843
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
Name: occupation, dtype: int64
<=50K    24720
>50K      7841
Name: income, dtype: int64

可以从结果中看到workclass和occupation变量中包括 “ ？”，接下来删除包含问号的行

df = df[df['occupation'] != "?"]
df = df[df['workclass'] != "?"]

参考文献

https://www.cnblogs.com/cocowool/p/8421997.html

使用get_dummies()函数对分类变量进行转换

df_dummies = pd.get_dummies(df)
print("Features after get_dummies: \n", list(df_dummies.columns))

输出结果

Features after get_dummies:
 ['age', 'hours-per-week', 'workclass_Federal-gov', 'workclass_Local-gov', 'workclass_Private', 'workclass_Self-emp-inc', 'workclass_Self-emp-not-inc', 'workclass_State-gov', 'workclass_Without-pay', 'education_10th', 'education_11th', 'education_12th', 'education_1st-4th', 'education_5th-6th', 'education_7th-8th', 'education_9th', 'education_Assoc-acdm', 'education_Assoc-voc', 'education_Bachelors', 'education_Doctorate', 'education_HS-grad', 'education_Masters', 'education_Preschool', 'education_Prof-school', 'education_Some-college', 'gender_Female', 'gender_Male', 'occupation_Adm-clerical', 'occupation_Armed-Forces', 'occupation_Craft-repair', 'occupation_Exec-managerial', 'occupation_Farming-fishing', 'occupation_Handlers-cleaners', 'occupation_Machine-op-inspct', 'occupation_Other-service', 'occupation_Priv-house-serv', 'occupation_Prof-specialty', 'occupation_Protective-serv', 'occupation_Sales', 'occupation_Tech-support', 'occupation_Transport-moving', 'income_<=50K', 'income_>50K']

接下来训练逻辑斯蒂回归分类模型

features = df_dummies.ix[:,'age':'occupation_Transport-moving'] # 这个语句不太明白
X = features.values
Y = df_dummies['income_>50K'].values
print("X.shape: {} Y.shape:{}".format(X.shape,Y.shape))
#输出
X.shape: (30718, 41) Y.shape:(30718,)

使用 ix()时遇到

C:\Users\mingy\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.

训练模型

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,Y,random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
#输出结果
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
print("Test score:{:.2f}".format(logreg.score(X_test,y_test)))
#输出结果
Test score:0.81

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2019-02-28，如有侵权请联系 cloudcommunity@tencent.com 删除

编程算法