# Kaggle经典数据分析项目：泰坦尼克号生存预测！

Datawhale干货

1.开源项目《动手学数据分析》：

https://github.com/datawhalechina/hands-on-data-analysis

2.DCIC 2020算法分析赛：DCIC 是国内少有的开放政府真实数据的经典赛事，对能力实践，学术研究等都提供了很好的机会。

https://mp.weixin.qq.com/s/-fzQIlZRig0hqSm7GeI_Bw

# 1. 数据概述与可视化

## 1.1 数据概述

```train_data = pd.read_csv("input/train.csv", index_col=0)
```
```train_data.describe()
```

```train_data.isnull().sum().sort_values(ascending=False).head(4)
```

```Cabin       687
Age         177
Embarked      2
Fare          0
dtype: int64
```

## 1.2 数据可视化

### 1.2.1 性别与生存率

```sns.barplot(x="Sex", y="Survived", data=train_data)
```

### 1.2.2 仓位等级（社会等级）与生存率

```#draw a bar plot of survival by Pclass
sns.barplot(x="Pclass", y="Survived", data=train)

#print percentage of people by Pclass that survived
print("Percentage of Pclass = 1 who survived:", train["Survived"][train["Pclass"] == 1].value_counts(normalize = True)[1]*100)

print("Percentage of Pclass = 2 who survived:", train["Survived"][train["Pclass"] == 2].value_counts(normalize = True)[1]*100)

print("Percentage of Pclass = 3 who survived:", train["Survived"][train["Pclass"] == 3].value_counts(normalize = True)[1]*100)
```
```Percentage of Pclass = 1 who survived: 62.96296296296296
Percentage of Pclass = 2 who survived: 47.28260869565217
Percentage of Pclass = 3 who survived: 24.236252545824847
```

As predicted, people with higher socioeconomic class had a higher rate of survival. (62.9% vs. 47.3% vs. 24.2%)

### 1.2.3 家属数与生存率

```#draw a bar plot for SibSp vs. survival
sns.barplot(x="SibSp", y="Survived", data=train)

#I won't be printing individual percent values for all of these.
print("Percentage of SibSp = 0 who survived:", train["Survived"][train["SibSp"] == 0].value_counts(normalize = True)[1]*100)

print("Percentage of SibSp = 1 who survived:", train["Survived"][train["SibSp"] == 1].value_counts(normalize = True)[1]*100)

print("Percentage of SibSp = 2 who survived:", train["Survived"][train["SibSp"] == 2].value_counts(normalize = True)[1]*100)
```
```Percentage of SibSp = 0 who survived: 34.53947368421053
Percentage of SibSp = 1 who survived: 53.588516746411486
Percentage of SibSp = 2 who survived: 46.42857142857143
```

In general, it's clear that people with more siblings or spouses aboard were less likely to survive. However, contrary to expectations, people with no siblings or spouses were less to likely to survive than those with one or two. (34.5% vs 53.4% vs. 46.4%)

```#draw a bar plot for Parch vs. survival
sns.barplot(x="Parch", y="Survived", data=train)
plt.show()
```

People with less than four parents or children aboard are more likely to survive than those with four or more. Again, people traveling alone are less likely to survive than those with 1-3 parents or children.

### 1.2.4 年龄与生存率

```#sort the ages into logical categories
train["Age"] = train["Age"].fillna(-0.5)
test["Age"] = test["Age"].fillna(-0.5)
bins = [-1, 0, 5, 12, 18, 24, 35, 60, np.inf]
train['AgeGroup'] = pd.cut(train["Age"], bins, labels = labels)
test['AgeGroup'] = pd.cut(test["Age"], bins, labels = labels)

#draw a bar plot of Age vs. survival
sns.barplot(x="AgeGroup", y="Survived", data=train)
plt.show()
```

### 1.2.5 仓位特征是否存在与生存率

I think the idea here is that people with recorded cabin numbers are of higher socioeconomic class, and thus more likely to survive.

```test["CabinBool"] = (test["Cabin"].notnull().astype('int'))

#calculate percentages of CabinBool vs. survived
print("Percentage of CabinBool = 1 who survived:", train["Survived"][train["CabinBool"] == 1].value_counts(normalize = True)[1]*100)

print("Percentage of CabinBool = 0 who survived:", train["Survived"][train["CabinBool"] == 0].value_counts(normalize = True)[1]*100)
#draw a bar plot of CabinBool vs. survival
sns.barplot(x="CabinBool", y="Survived", data=train)
plt.show()
```

```Percentage of CabinBool = 1 who survived: 66.66666666666666
Percentage of CabinBool = 0 who survived: 29.985443959243085
```

# 2. 数据预处理

## 2.1 拼接数据集

```y_train = train_data.pop("Survived")
data_all = pd.concat((train_data, test_data), axis=0)
```

## 2.2 处理Name特征，提取出Title

```title = pd.DataFrame()
title["Title"] = data_all["Name"].map(lambda name:name.split(",")[1].split(".")[0].strip())
Title_Dictionary = {
"Capt":       "Officer",
"Col":        "Officer",
"Major":      "Officer",
"Jonkheer":   "Royalty",
"Don":        "Royalty",
"Sir" :       "Royalty",
"Dr":         "Officer",
"Rev":        "Officer",
"the Countess":"Royalty",
"Dona":       "Royalty",
"Mme":        "Mrs",
"Mlle":       "Miss",
"Ms":         "Mrs",
"Mr" :        "Mr",
"Mrs" :       "Mrs",
"Miss" :      "Miss",
"Master" :    "Master",
}
title[ 'Title' ] = title.Title.map(Title_Dictionary)
title = pd.get_dummies(title.Title)
data_all = pd.concat((data_all, title), axis=1)
data_all.pop("Name")
```

## 2.3 提取其他特征

```data_all["Cabin"].fillna("NA", inplace=True)
data_all["Cabin"] = data_all["Cabin"].map(lambda s:s[0])
data_all.pop("Ticket")
```

```data_all["Pclass"] = data_all["Pclass"].astype(str)
feature_dummies = pd.get_dummies(data_all[["Pclass", "Sex", "Embarked", "Cabin"]])
data_all.drop(["Pclass", "Sex", "Embarked", "Cabin"], inplace=True, axis=1)
data_all = pd.concat((data_all, feature_dummies), axis=1)
```

```mean_cols = data_all.mean()
data_all = data_all.fillna(mean_cols)
```

## 2.4 将训练集测试集重新分开

```train_df = data_all.loc[train_data.index]
test_df = data_all.loc[test_data.index]
print(train_df.shape, test_df.shape)
```

# 3. 模型训练

## 3.1 Random Forest

```from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import sklearn
```

```%matplotlib inline
depth_ = [1, 2, 3, 4, 5, 6, 7, 8]
scores = []
for depth in depth_:
clf = RandomForestClassifier(n_estimators=100, max_depth=depth, random_state=0)
test_score = cross_val_score(clf, train_df, y_train, cv=10, scoring="precision")
scores.append(np.mean(test_score))
plt.plot(depth_, scores)
```

```from sklearn.ensemble import GradientBoostingClassifier
depth_ = [1, 2, 3, 4, 5, 6, 7, 8]
scores = []
for depth in depth_:
test_score = cross_val_score(clf, train_df, y_train, cv=10, scoring="precision")
scores.append(np.mean(test_score))
plt.plot(depth_, scores)
```

## 3.3 Bagging

Bagging把很多小分类器放在一起，每个train随机的一部分数据，然后把它们的最终结果综合起来（多数投票制）

```from sklearn.ensemble import BaggingClassifier
params = [1, 10, 15, 20, 25, 30, 40]
test_scores = []

for param in params:
clf = BaggingClassifier(n_estimators=param)
test_score = cross_val_score(clf, train_df, y_train, cv=10, scoring="precision")
test_scores.append(np.mean(test_score))
plt.plot(params, test_scores)
```

## 3.4 RidgeClassifier

```from sklearn.linear_model import RidgeClassifier
alphas = np.logspace(-3, 2, 50)
test_scores = []

for alpha in alphas:
clf = RidgeClassifier(alpha)
test_score = cross_val_score(clf, train_df, y_train, cv=10, scoring="precision")
test_scores.append(np.mean(test_score))
plt.plot(alphas, test_scores)
```

## 3.5 RidgeClassifier + Bagging

```ridge = RidgeClassifier(alpha=5)
params = [1, 10, 15, 20, 25, 30, 40]
test_scores = []

for param in params:
clf = BaggingClassifier(n_estimators=param, base_estimator=ridge)
test_score = cross_val_score(clf, train_df, y_train, cv=10, scoring="precision")
test_scores.append(np.mean(test_score))
plt.plot(params, test_scores)
```

## 3.6 XGBClassifier

```from xgboost import XGBClassifier
params = [1, 2, 3, 4, 5, 6]
test_scores = []
for param in params:
clf = XGBClassifier(max_depth=param)
test_score = cross_val_score(clf, train_df, y_train, cv=10, scoring="precision")
test_scores.append(np.mean(test_score))
plt.plot(params, test_scores)
```

## 3.7 神经网络

```import tensorflow as tf
import keras
from keras.models import Sequential
from keras.layers import *

)
model = Sequential()
model.add(Dense(32, kernel_initializer = 'uniform', activation = 'relu'))
model.add(Dense(32,kernel_initializer = 'uniform', activation = 'relu'))

```

```history = model.fit(np.array(train_df), np.array(y_train), epochs=20, batch_size=50, validation_split = 0.2)
```

```Epoch 20/20
712/712 [==============================] - 0s 43us/step - loss: 0.4831 - accuracy: 0.7978 - val_loss: 0.3633 - val_accuracy: 0.8715
```

```model.summary()
```
```Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
dense_1 (Dense)              (None, 32)                896
_________________________________________________________________
dense_2 (Dense)              (None, 32)                1056
_________________________________________________________________
dropout_1 (Dropout)          (None, 32)                0
_________________________________________________________________
dense_3 (Dense)              (None, 32)                1056
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 33
=================================================================
Total params: 3,041
Trainable params: 3,041
Non-trainable params: 0
_________________________________________________________________
```

```scores = model.evaluate(train_df, y_train, batch_size=32)
print(scores)
```
```891/891 [==============================] - 0s 18us/step
[0.4208374666645872, 0.8316498398780823]
```

# 4. 模型优化（调参）

```from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import RidgeClassifier
import sklearn

classifier_num = 5
clf = [0 for i in range(classifier_num)]
clf[0] = RandomForestClassifier(n_estimators=100, max_depth=6, random_state=0)
clf[2] = RidgeClassifier(5)
clf[3] = BaggingClassifier(n_estimators=15, base_estimator=clf[2])
clf[4] = XGBClassifier(max_depth=2)

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(train_df, y_train, random_state=0)

predictFrame = pd.DataFrame()
for model in clf:
model.fit(X_train, Y_train)
predictFrame[str(model)[:13]] = model.predict(X_test)
```

```%matplotlib inline
depth_ = [1, 2, 3, 4, 5, 6, 7, 8]
scores = []
for depth in depth_:
clf_ = RandomForestClassifier(n_estimators=100, max_depth=depth, random_state=0)
test_score = cross_val_score(clf_, predictFrame, Y_test, cv=10, scoring="precision")
scores.append(np.mean(test_score))
plt.plot(depth_, scores)
```

```finalFrame = pd.DataFrame()
XFrame = pd.DataFrame()
for model in clf:
model.fit(train_df, y_train)
XFrame[str(model)[:13]] = model.predict(train_df)
finalFrame[str(model)[:13]] = model.predict(test_df)
final_clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
final_clf.fit(XFrame, y_train)
result = final_clf.predict(finalFrame)
```

0 条评论

• ### 二分类问题：基于BERT的文本分类实践！附完整代码

寄语：Bert天生适合做分类任务。文本分类有fasttext、textcnn等多种方法，但在Bert面前，就是小巫见大巫了。

• ### NLP详细教程：手把手教你用ELMo模型提取文本特征，附代码&论文

我致力于研究自然语言处理（NLP）领域相关问题。每个NLP问题都是一次独特的挑战，同时又反映出人类语言是多么复杂、美丽又绝妙。

• ### 机器学习数学基础：数理统计与描述性统计

所谓机器学习和深度学习， 背后的逻辑都是数学， 所以数学基础在这个领域非常关键， 而统计学又是重中之重， 机器学习从某种意义上来说就是一种统计学习。

• ### AutoKeras---自动机器学习

Auto-Keras是用于自动机器学习的开源软件库。目的是让仅拥有一定数据科学知识或机器学习背景的行业专家可以轻松地应用深度学习模型。

• ### TIANCHI-津南数字制造算法挑战赛【赛场一】基本分析&Baseline

有趣的是，训练集中缺失值比较多的A23和A21在测试集中并无缺失，同样地，测试集中缺失概率达67%的A25、A27和A20反而在训练集中并无缺失。(此处作者笔误...

• ### 项目实战01：“达观杯”文本竞赛

》train_set.csv：此数据集用于训练模型，每一行对应一篇文章。文章分别在“字”和“词”的级别上做了脱敏处理。共有四列：

• ### 如何去理解 拓扑排序算法

查看Castle的代码，在Castle.Core中内部的数据结构采用图，排序使用的拓扑排序算法：        对于一条有向边(u,v)，定义u < v；满足所...

• ### 从补丁到漏洞分析——记一次joomla漏洞应急

2018年1月30日，joomla更新了3.8.4版本，这次更新修复了4个安全漏洞，以及上百个bug修复。