前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >建模过程中分类变量的处理(笔记一)

建模过程中分类变量的处理(笔记一)

作者头像
用户7010445
发布2020-03-03 14:25:09
2K0
发布2020-03-03 14:25:09
举报

本文的内容来自参考书《Python机器学习基础教程》第四章数据表示与特征工程第一小节的内容

自己最浅显的理解:数学建模是基于数学表达式,数学表达式只认数字(连续变量),不认字符(分类变量);那么如何将我们收集到的数据中的字符转换成数字,科学家起了一个比较高端的名字叫做特征工程(feature engineering) 比如这一小节中使用到的示例数据:1994年美国成年人的收入,此数据集的任务是预测一名工人的收入是高于50,000美元还是低于50,000美元。数据集中的变量包括:

  • age
  • workclass
  • educatiuon
  • gender
  • hours-per-week
  • occupation
  • income

其中age(年龄)和hours-per-week(每周工作时长)便是连续特征;而workclass(工作类型)、education(教育程度)、gender(性别)和occupation(职业)都是分类变量。 那么如何处理这种情况,一种解决办法是使用one-hot编码(或者叫做N取一编码,也叫作虚拟变量dummy variable)。虚拟变量背后的思想就是将一个分类变量替换为一个或多个新特征,新特征取值为0,1,对于数学公式而言0,1两个值是有意义的。比如数据集

seq

gender

income

hours-per-week

1

Male

50,000

50

2

Female

60,000

40

经过转换就变成另外的格式

seq

Male

Female

income

hours-per-week

1

1

0

50,000

50

2

0

1

60,000

40

python中实现这种转换法的一种方式是使用pandas中的 get_dummies() 函数

接下来是重复书中的案例

第一步:下载数据集

使用搜索引擎搜索adult.data关键词,找到下载地址 http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data 可以选择将其复制到文本文件中,也可以选择使用python将其抓取下来,这应该是python爬虫一个非常简单的案例

  • python抓取代码
代码语言:javascript
复制
from urllib.request import urlopen
html = urlopen("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data")
adult_data = html.read()
adult_data = adult_data.decode('utf-8')
fw = open('adult.data',"w",encoding = "utf-8")
fw.write(adult_data)
fw.close()
参考文献
  • https://blog.csdn.net/xman4code/article/details/80989601
  • https://www.jianshu.com/p/cfbdacbeac6e
第二步:数据处理与建模
代码语言:javascript
复制
import pandas as pd
df = pd.read_csv('adult.data',header=None,index_col=False, \
names = ['age','workclass','fnlwgt','education','education-num', \
'marital-status','occupation','relationship','race','gender',\
'capital-gain','capital-loss','hours-per-week','native-country','income'])
df.head()
输出结果
代码语言:javascript
复制
age         workclass  fnlwgt  education  education-num  \
0   39         State-gov   77516  Bachelors             13   
1   50  Self-emp-not-inc   83311  Bachelors             13   
2   38           Private  215646    HS-grad              9   
3   53           Private  234721       11th              7   
4   28           Private  338409  Bachelors             13   

       marital-status         occupation   relationship   race  gender  \
0       Never-married       Adm-clerical  Not-in-family  White    Male   
1  Married-civ-spouse    Exec-managerial        Husband  White    Male   
2            Divorced  Handlers-cleaners  Not-in-family  White    Male   
3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male   
4  Married-civ-spouse     Prof-specialty           Wife  Black  Female   

   capital-gain  capital-loss  hours-per-week native-country income  
0          2174             0              40  United-States  <=50K  
1             0             0              13  United-States  <=50K  
2             0             0              40  United-States  <=50K  
3             0             0              40  United-States  <=50K  
4             0             0              40           Cuba  <=50K

选择特定的变量

代码语言:javascript
复制
df = df[['age','workclass','education','gender','hours-per-week','occupation','income']]

检查分类数据是否存在异常使用到value_counts()函数:显示唯一值及其出现次数

代码语言:javascript
复制
for char in list(df.columns):
    if df[char].dtypes == "object":
        print(df[char].value_counts())

输出结果

代码语言:javascript
复制
Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64
HS-grad         10501
Some-college     7291
Bachelors        5355
Masters          1723
Assoc-voc        1382
11th             1175
Assoc-acdm       1067
10th              933
7th-8th           646
Prof-school       576
9th               514
12th              433
Doctorate         413
5th-6th           333
1st-4th           168
Preschool          51
Name: education, dtype: int64
Male      21790
Female    10771
Name: gender, dtype: int64
Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
?                    1843
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
Name: occupation, dtype: int64
<=50K    24720
>50K      7841
Name: income, dtype: int64

可以从结果中看到workclass和occupation变量中包括 “ ?”,接下来删除包含问号的行

代码语言:javascript
复制
df = df[df['occupation'] != "?"]
df = df[df['workclass'] != "?"]
参考文献
  • https://www.cnblogs.com/cocowool/p/8421997.html

使用get_dummies()函数对分类变量进行转换

代码语言:javascript
复制
df_dummies = pd.get_dummies(df)
print("Features after get_dummies: \n", list(df_dummies.columns))

输出结果

代码语言:javascript
复制
Features after get_dummies:
 ['age', 'hours-per-week', 'workclass_Federal-gov', 'workclass_Local-gov', 'workclass_Private', 'workclass_Self-emp-inc', 'workclass_Self-emp-not-inc', 'workclass_State-gov', 'workclass_Without-pay', 'education_10th', 'education_11th', 'education_12th', 'education_1st-4th', 'education_5th-6th', 'education_7th-8th', 'education_9th', 'education_Assoc-acdm', 'education_Assoc-voc', 'education_Bachelors', 'education_Doctorate', 'education_HS-grad', 'education_Masters', 'education_Preschool', 'education_Prof-school', 'education_Some-college', 'gender_Female', 'gender_Male', 'occupation_Adm-clerical', 'occupation_Armed-Forces', 'occupation_Craft-repair', 'occupation_Exec-managerial', 'occupation_Farming-fishing', 'occupation_Handlers-cleaners', 'occupation_Machine-op-inspct', 'occupation_Other-service', 'occupation_Priv-house-serv', 'occupation_Prof-specialty', 'occupation_Protective-serv', 'occupation_Sales', 'occupation_Tech-support', 'occupation_Transport-moving', 'income_<=50K', 'income_>50K']
接下来训练逻辑斯蒂回归分类模型
代码语言:javascript
复制
features = df_dummies.ix[:,'age':'occupation_Transport-moving'] # 这个语句不太明白
X = features.values
Y = df_dummies['income_>50K'].values
print("X.shape: {} Y.shape:{}".format(X.shape,Y.shape))
#输出
X.shape: (30718, 41) Y.shape:(30718,)

使用 ix()时遇到

代码语言:javascript
复制
C:\Users\mingy\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.

训练模型

代码语言:javascript
复制
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,Y,random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
#输出结果
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
print("Test score:{:.2f}".format(logreg.score(X_test,y_test)))
#输出结果
Test score:0.81
本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2019-02-28,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 小明的数据分析笔记本 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 第一步:下载数据集
  • 参考文献
  • 第二步:数据处理与建模
  • 输出结果
  • 参考文献
  • 接下来训练逻辑斯蒂回归分类模型
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档