机器学习入门数据集--4.泰坦尼克幸存者预测

kaggle竞赛数据集 泰坦尼克幸存者预测

官网

列名

解释

survival

是否幸存 0 = No, 1 = Yes

pclass

船票等级

sex

性别

Age

年龄

sibsp

船上兄弟姐妹/配偶的个数

parch

船上父母/孩子的个数

ticket

船票号

fare

船票价格

cabin

船舱号码

embarked

登船口 C = Cherbourg, Q = Queenstown, S = Southampton

数据预处理

1. 统计分析pclass船票等级

import pandas as pd #数据分析
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False #用来正常显示负号

data_train = pd.read_csv("/Users/wangsen/ai/03/3day_feature/code/a8_titanic/data/train.csv")
print(data_train.columns)

#看看各乘客等级的获救情况
Survived_0 = data_train.Pclass[data_train.Survived == 0].value_counts()
print(Survived_0)

Survived_1 = data_train.Pclass[data_train.Survived == 1].value_counts()
print(Survived_1)
df=pd.DataFrame({u'获救':Survived_1, u'未获救':Survived_0})
print(df)

df.plot(kind='bar', stacked=True)
plt.title(u"各乘客等级的获救情况")
plt.xlabel(u"乘客等级")
plt.ylabel(u"人数")
plt.show()

结果:

   未获救   获救
1   80  136
2   97   87
3  372  119

2. 统计embarked港口分析

Survived_0 = data_train.Embarked[data_train.Survived == 0].value_counts()
Survived_1 = data_train.Embarked[data_train.Survived == 1].value_counts()
df=pd.DataFrame({u'获救':Survived_1, u'未获救':Survived_0})
print(df)

结果:

   未获救   获救
S  427  217
C   75   93
Q   47   30

3.性别sex统计分析

Survived_m = data_train.Survived[data_train.Sex == 'male'].value_counts()
Survived_f = data_train.Survived[data_train.Sex == 'female'].value_counts()
df=pd.DataFrame({u'男性':Survived_m, u'女性':Survived_f})
print(df)
    女性   男性
0   81  468
1  233  109

4. 处理年龄空值

data_train.info()

查看数据,年龄字段中有

RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

将年龄作为预测目标,通过算法进行拟合。

    age_df = df[['Age', 'Fare', 'Parch', 'SibSp', 'Pclass']]
    # 乘客分成已知年龄和未知年龄两部分
    known_age = age_df[age_df.Age.notnull()].as_matrix()
    unknown_age = age_df[age_df.Age.isnull()].as_matrix()
    # y即目标年龄
    y = known_age[:, 0]
    # X即特征属性值
    X = known_age[:, 1:]
    # fit到RandomForestRegressor之中
    rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1)
    rfr.fit(X, y)
    # 用得到的模型进行未知年龄结果预测
    predictedAges = rfr.predict(unknown_age[:, 1::])
    # 用得到的预测结果填补原缺失数据
    df.loc[(df.Age.isnull()), 'Age'] = predictedAges

5.建模

最后只选取8个维度 Pclass Age SibSp Parch Sex Cabin Fare Embarked。dummy编码进行维度扩展。

Pclass        891 non-null int64
Age           891 non-null float64
SibSp         891 non-null int64
Parch         891 non-null int64
Fare          891 non-null float64
Sex_female    891 non-null uint8
Sex_male      891 non-null uint8
Cabin_NO      891 non-null uint8
Cabin_YES     891 non-null uint8
Embarked_C    891 non-null uint8
Embarked_Q    891 non-null uint8
Embarked_S    891 non-null uint8

建模:

import numpy as np
import pandas as pd
test_file = "/Users/wangsen/ai/03/3day_feature/code/a8_titanic/data/test.csv"
train_file = "/Users/wangsen/ai/03/3day_feature/code/a8_titanic/data/train.csv"
train_df = pd.read_csv(train_file)
test_df = pd.read_csv(test_file)

train_target = train_df.pop('Survived')
train_df = train_df.drop(['PassengerId','Name','Ticket'],axis=1)
test_df = test_df.drop(['PassengerId','Name','Ticket'],axis=1)
train_len = train_df.shape[0]
test_len = test_df.shape[0]
all_df = pd.concat((train_df,test_df),axis=0)
train_df.loc[train_df.Cabin.isnull()==False,'Cabin'] = 'YES'
train_df.loc[train_df.Cabin.isnull()==True,'Cabin']='NO'
print(train_df['Cabin'].value_counts())
train_df.Age = train_df.Age.fillna(train_df.Age.mean())
train_dummy_df = pd.get_dummies(train_df)
#train_df.info()
#print(train_df.describe())

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(train_dummy_df,train_target)
score = lr.score(train_dummy_df,train_target)
print("简单逻辑回归的准确率:",score)

结果:

简单逻辑回归的准确率: 0.8002244668911336

总结

泰坦尼克数据集比较规整,只需要处理较少量缺失值。用线性逻辑回归模型,可以达到80%的预测准确性。

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

扫码关注云+社区

领取腾讯云代金券