首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >机器学习入门数据集--4.泰坦尼克幸存者预测

机器学习入门数据集--4.泰坦尼克幸存者预测

作者头像
birdskyws
发布2019-03-04 10:47:16
5160
发布2019-03-04 10:47:16
举报

kaggle竞赛数据集 泰坦尼克幸存者预测

官网

列名

解释

survival

是否幸存 0 = No, 1 = Yes

pclass

船票等级

sex

性别

Age

年龄

sibsp

船上兄弟姐妹/配偶的个数

parch

船上父母/孩子的个数

ticket

船票号

fare

船票价格

cabin

船舱号码

embarked

登船口 C = Cherbourg, Q = Queenstown, S = Southampton

数据预处理

1. 统计分析pclass船票等级
import pandas as pd #数据分析
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False #用来正常显示负号

data_train = pd.read_csv("/Users/wangsen/ai/03/3day_feature/code/a8_titanic/data/train.csv")
print(data_train.columns)

#看看各乘客等级的获救情况
Survived_0 = data_train.Pclass[data_train.Survived == 0].value_counts()
print(Survived_0)

Survived_1 = data_train.Pclass[data_train.Survived == 1].value_counts()
print(Survived_1)
df=pd.DataFrame({u'获救':Survived_1, u'未获救':Survived_0})
print(df)

df.plot(kind='bar', stacked=True)
plt.title(u"各乘客等级的获救情况")
plt.xlabel(u"乘客等级")
plt.ylabel(u"人数")
plt.show()

结果:

   未获救   获救
1   80  136
2   97   87
3  372  119
2. 统计embarked港口分析
Survived_0 = data_train.Embarked[data_train.Survived == 0].value_counts()
Survived_1 = data_train.Embarked[data_train.Survived == 1].value_counts()
df=pd.DataFrame({u'获救':Survived_1, u'未获救':Survived_0})
print(df)

结果:

   未获救   获救
S  427  217
C   75   93
Q   47   30
3.性别sex统计分析
Survived_m = data_train.Survived[data_train.Sex == 'male'].value_counts()
Survived_f = data_train.Survived[data_train.Sex == 'female'].value_counts()
df=pd.DataFrame({u'男性':Survived_m, u'女性':Survived_f})
print(df)
    女性   男性
0   81  468
1  233  109
4. 处理年龄空值
data_train.info()

查看数据,年龄字段中有

RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

将年龄作为预测目标,通过算法进行拟合。

    age_df = df[['Age', 'Fare', 'Parch', 'SibSp', 'Pclass']]
    # 乘客分成已知年龄和未知年龄两部分
    known_age = age_df[age_df.Age.notnull()].as_matrix()
    unknown_age = age_df[age_df.Age.isnull()].as_matrix()
    # y即目标年龄
    y = known_age[:, 0]
    # X即特征属性值
    X = known_age[:, 1:]
    # fit到RandomForestRegressor之中
    rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1)
    rfr.fit(X, y)
    # 用得到的模型进行未知年龄结果预测
    predictedAges = rfr.predict(unknown_age[:, 1::])
    # 用得到的预测结果填补原缺失数据
    df.loc[(df.Age.isnull()), 'Age'] = predictedAges
5.建模

最后只选取8个维度 Pclass Age SibSp Parch Sex Cabin Fare Embarked。dummy编码进行维度扩展。

Pclass        891 non-null int64
Age           891 non-null float64
SibSp         891 non-null int64
Parch         891 non-null int64
Fare          891 non-null float64
Sex_female    891 non-null uint8
Sex_male      891 non-null uint8
Cabin_NO      891 non-null uint8
Cabin_YES     891 non-null uint8
Embarked_C    891 non-null uint8
Embarked_Q    891 non-null uint8
Embarked_S    891 non-null uint8

建模:

import numpy as np
import pandas as pd
test_file = "/Users/wangsen/ai/03/3day_feature/code/a8_titanic/data/test.csv"
train_file = "/Users/wangsen/ai/03/3day_feature/code/a8_titanic/data/train.csv"
train_df = pd.read_csv(train_file)
test_df = pd.read_csv(test_file)

train_target = train_df.pop('Survived')
train_df = train_df.drop(['PassengerId','Name','Ticket'],axis=1)
test_df = test_df.drop(['PassengerId','Name','Ticket'],axis=1)
train_len = train_df.shape[0]
test_len = test_df.shape[0]
all_df = pd.concat((train_df,test_df),axis=0)
train_df.loc[train_df.Cabin.isnull()==False,'Cabin'] = 'YES'
train_df.loc[train_df.Cabin.isnull()==True,'Cabin']='NO'
print(train_df['Cabin'].value_counts())
train_df.Age = train_df.Age.fillna(train_df.Age.mean())
train_dummy_df = pd.get_dummies(train_df)
#train_df.info()
#print(train_df.describe())

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(train_dummy_df,train_target)
score = lr.score(train_dummy_df,train_target)
print("简单逻辑回归的准确率:",score)

结果:

简单逻辑回归的准确率: 0.8002244668911336

总结

泰坦尼克数据集比较规整,只需要处理较少量缺失值。用线性逻辑回归模型,可以达到80%的预测准确性。

本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2019.02.12 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • kaggle竞赛数据集 泰坦尼克幸存者预测
  • 数据预处理
    • 1. 统计分析pclass船票等级
      • 2. 统计embarked港口分析
        • 3.性别sex统计分析
          • 4. 处理年龄空值
            • 5.建模
            • 总结
            领券
            问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档