# 通过一个 kaggle 实例学习解决机器学习问题

Titanic: Machine Learning from Disaster 这个问题中，要解决的是根据所提供的 age，sex 等因素的数据，判断哪些乘客更有可能生存下来，所以这是一个分类问题。

## 1. Data Exploration

``````import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
``````
``````
``````

``````train.tail()
train.describe()
``````

## 2. Data Cleaning

``````train.isnull().sum()
#test.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
``````

## 3. Feature Engineering

``````pclass          Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)
``````

• Name 里可以抓取到这样的字眼，来反映出乘客的职场地位： ['Capt', 'Col', 'Major', 'Dr', 'Officer', 'Rev']。
• Cabin 里的 [a-zA-Z] 也许可以反映出社会地位。
• Cabin 里的 [0-9] 可能代表船舱的地理位置。
• SibSp 可以算出乘客中同一家庭成员人数的大小。
``````title[title.isin(['Capt', 'Col', 'Major', 'Dr', 'Officer', 'Rev'])] = 'Officer'

deck = full[~full.Cabin.isnull()].Cabin.map( lambda x : re.compile("([a-zA-Z]+)").search(x).group())

checker = re.compile("([0-9]+)")

full['Group_num'] = full.Parch + full.SibSp + 1
``````

``````train = pd.get_dummies(train, columns=['Embarked', 'Pclass', 'Title', 'Group_size'])

full['NorFare'] = pd.Series(scaler.fit_transform(full.Fare.reshape(-1,1)).reshape(-1), index=full.index)

full.drop(labels=['PassengerId', 'Name', 'Cabin', 'Survived', 'Ticket', 'Fare'], axis=1, inplace=True)
``````

## 4. Model Building

``````from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
``````

``````from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score

scoring = make_scorer(accuracy_score, greater_is_better=True)

def get_model(estimator, parameters, X_train, y_train, scoring):
model = GridSearchCV(estimator, param_grid=parameters, scoring=scoring)
model.fit(X_train, y_train)
return model.best_estimator_
``````

• 从 sklearn 导入分类器模型后，定义一个 KNN，
• 定义合适的参数集 parameters，
• 然后用 get_model 去训练 KNN 模型，
• 接下来用训练好的模型去预测测试集的数据，并得到 accuracy_score，
• 然后画出 learning_curve。
``````from sklearn.neighbors import KNeighborsClassifier
KNN = KNeighborsClassifier(weights='uniform')
parameters = {'n_neighbors':[3,4,5], 'p':[1,2]}
clf_knn = get_model(KNN, parameters, X_train, y_train, scoring)

print (accuracy_score(y_test, clf_knn.predict(X_test)))
plot_learning_curve(clf_knn, 'KNN', X, y, cv=4);
``````

``````KNN， 0.816143497758
Random Forest， 0.829596412556

Logistic Regression， 0.811659192825
SVC， 0.838565022422
XGBoost， 0.820627802691
``````

## 5. Ensemble

``````from sklearn.ensemble import VotingClassifier
clf_vc = VotingClassifier(estimators=[('xgb1', clf_xgb1), ('lg1', clf_lg1), ('svc', clf_svc),
('rfc1', clf_rfc1),('rfc2', clf_rfc2), ('knn', clf_knn)],
voting='hard', weights=[4,1,1,1,1,2])
clf_vc = clf_vc.fit(X_train, y_train)

print (accuracy_score(y_test, clf_vc.predict(X_test)))
plot_learning_curve(clf_vc, 'Ensemble', X, y, cv=4);
``````
``````ensemble, 0.825112107623
``````

## 6. Prediction

``````def submission(model, fname, X):
ans = pd.DataFrame(columns=['PassengerId', 'Survived'])
ans.PassengerId = PassengerId
ans.Survived = pd.Series(model.predict(X), index=ans.index)
ans.to_csv(fname, index=False)``````

263 篇文章40 人订阅

0 条评论

## 相关文章

K近邻思想: 根据你的"邻居们"来确定你的类别 你一觉醒来,不知道自己身在何方里,你能通过计算机定位到周围5个"最近的"邻居,其中有4个身处火星,1个身处月...

3605

7920

### 【论文推荐】最新八篇推荐系统相关论文—可解释推荐、上下文感知推荐系统、异构知识库嵌入、深度强化学习、移动推荐系统

【导读】专知内容组既昨天推出八篇推荐系统相关论文之后，今天为大家又推出八篇推荐系统（Recommendation System）相关论文，欢迎查看!

3733

### 【论文推荐】最新5篇知识图谱相关论文—强化学习、习知识图谱的表示、词义消除歧义、并行翻译嵌入、图数据库

【导读】专知内容组整理了最近五篇知识图谱（Knowledge Graph）相关文章，为大家进行介绍，欢迎查看! 1. DeepPath: A Reinforce...

4114

3673

8933

2983

86610

5276

4198