前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >案例实战|泰坦尼克号船员获救预测(数据预处理部分)

案例实战|泰坦尼克号船员获救预测(数据预处理部分)

作者头像
double
发布2018-04-02 16:54:50
1.2K0
发布2018-04-02 16:54:50
举报
文章被收录于专栏:算法channel算法channel

01

背景介绍

已经推送了一些经典的机器学习和深度学习相关的算法,明白了这些算法的原理,对我们之后解决实际问题会打下很好的基础,如何将这些零散的知识综合起来,从头到尾地解决一个实际问题呢,Kaggle会是一个很好的平台,它里面涉及地都是实际问题,并提供了相关的数据集,还有讨论,还会有牛人给出的分析解决方案,因此,我们也拿Kaggle中的项目来实战演练下。

今天,首先介绍参赛队伍最多的一个实际问题:泰坦尼克号船员获救预测,先看下项目的基本描述:

Competition Description 项目描述 The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. RMS泰坦沉船是人类历史上最著名的沉船事故之一。1912年4月15号,泰坦号在碰到一个冰块后而沉船,船上的2224名有1502人遇难。这起事故震惊使人,让人们更加注意轮船的规则。 One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. 导致这么多人遇难的原因之一是船上没有足够多的救生船。尽管有些侥幸逃生的因素,但是有些人群更可能幸存下来,比如女人,小孩,和上层人群。 In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy. 这次竞赛,我们想预测哪些人群更容易生还。特别地,希望大家用机器学习的模型来预测哪些人幸免于难。

再来看下数据集的描述:

The data has been split into two groups:

  • training set (train.csv)
  • test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features. 训练集,给出了每个船员的基本特征,比如性别,阶层,及最后的获救情况。 The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic. 测试集只包含基本特征,不包含获救情况(not include ground truth),需要用上面训练得到的模型预测获救情况。 We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like. 一个结果提交的样例

数据字典

字段 说明 取值

survival Survival 0 = No, 1 = Yes

pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd

sex Sex

Age Age in years

sibsp # of siblings / spouses aboard the Titanic

parch # of parents / children aboard the Titanic

ticket Ticket number

fare Passenger fare

cabin Cabin number

embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

变量说明

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5 sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored) parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny(保姆), therefore parch=0 for them.

02

数据预处理:特征工程

先用pandas看下前5条数据,长得样子如下,Survived列为标签值,1表示获救,0表示未获救。

调用pandas的describe查看下训练数据的整体统计情况:

可以看出,训练集一共有891条数据,其中Age列为714个,表示有些列的取值为空,因此先对这些值做一些数据清洗理(同时修改训练集和测试集的Age列),将对这些NaN的值填充上均值与方差的上下浮动区间值,然后将Age的类型由floate64转化为int,如下所示:

full_data = [train, test] for dataset in full_data: age_avg = dataset['Age'].mean() age_std = dataset['Age'].std() age_null_count = dataset['Age'].isnull().sum() age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count) dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list #Cannot convert non-finite values (NA or inf) to integer,因此需要先做Na检查 dataset['Age'] = dataset['Age'].astype(int)

机器学习中特征工程是非常重要的,对各个特征的分析研究,对于最后的预测结果起到至关重要的作用,因此一定要花足够多的时间来分析特征,构思各个特征间的关系,是不是有些特征可以合并为一个新的特征,有些特征可以过滤掉等等。

好了,完成和年龄相关的特征处理,根据这个项目的要求和实际的经验(让女人和小孩先走,这是电影中的一句台词),所以先分析了年龄相关的这个重要特征。

接下来,看看和家庭成员相关的两个特征:SibSp(姊妹,夫妻) 和 Parch(父母和孩子),根据这两个特征,我们转化为这样两个可能更好用的,合并两个为一个家庭成员数量的特征:FamilySize,添加是否为只是自己一个人在船上的特征 IsAlone

for dataset in full_data: dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1 for dataset in full_data: dataset['IsAlone'] = 0 dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1

接下来,除了和年龄,性别,家庭成员,另外还可能和人的身份相关吧,比如鄙人是不是贵族,乘坐的是不是头等舱,因此,需要根据人的Name,提取其 title 特征,

def get_title(name): title_search = re.search(' ([A-Za-z]+)\.', name) # If the title exists, extract and return it. if title_search: return title_search.group(1) return "" for dataset in full_data: dataset['Title'] = dataset['Name'].apply(get_title) for dataset in full_data: dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\ 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare') dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss') dataset['Title'] = dataset['Title'].replace('Ms', 'Miss') dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

其他杂项,包括,增加名字长度,是否有Cabin,是否停靠过等,这些比较简单,可以考虑添加,也可以考虑不添加。

train['Name_length'] = train['Name'].apply(len) test['Name_length'] = test['Name'].apply(len) train['Has_Cabin'] = train["Cabin"].apply(lambda x: 0 if type(x) == float else 1) test['Has_Cabin'] = test["Cabin"].apply(lambda x: 0 if type(x) == float else 1) for dataset in full_data: dataset['Embarked'] = dataset['Embarked'].fillna('S')

03

数据预处理:数据清洗

要想交给机器学习的算法来解决问题,预处理的数据每列必须为数值型的数据,现在,经过02步的操作后,我们看看,现在的数据表格各列的数据类型为:

这样不行啊,我们得将那些不是数值列的作转换,在转化前,我们先做一个特征选取吧,移除一些无用的特征,比如,PaggegerId,Name(已经转化为NameLength),Ticket,Cabin(转化为Has_Cabin),SibSp(转化为FamilySize),

drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp'] train = train.drop(drop_elements, axis = 1) test = test.drop(drop_elements, axis = 1)

移除这些列后,数据变为这样子:

进一步,对object列做数据类型转化:

dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int) title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5} dataset['Title'] = dataset['Title'].map(title_mapping) dataset['Title'] = dataset['Title'].fillna(0) dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

将Fare列和Age连续型数值变为离散型值(做分类)

for dataset in full_data: dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0 dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1 dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2 dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3 dataset['Fare'] = dataset['Fare'].astype(int) dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0 dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1 dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2 dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3 dataset.loc[ dataset['Age'] > 64, 'Age'] = 4 ;

数据清洗完毕后,最终的前5条数据长这样子,

所有的列都为数值型的:

特征提取和数据清洗工作告一段落,下面进入可视化展示阶段。

03

数据预处理:可视化展示

将以上所有特征,画出Pearson特征关系图,需要借助seaborn库,

colormap = plt.cm.RdBu plt.figure(figsize=(14,12)) plt.title('Pearson Correlation of Features', y=1.05, size=15) sns.heatmap(train.astype(float).corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True) plt.show()

通过这幅图,可以对比观察两两特征间的相关关系,数值越接近1或-1,相关性越强,越接近0,相关性越小。

接下来,再绘制一个多变量图,看下从一个特征到另一个特征的数据分布情况:

至此,泰坦尼克号船员预测的数据预处理任务完成,明天推送,这些数据feed到机器学习的算法中,然后得到一个预测模型,看一下在测试集上的表现如何,以及如何做出优化。

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2017-12-27,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 程序员郭震zhenguo 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 变量说明
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档