# 通过一个 kaggle 实例学习解决机器学习问题

Titanic: Machine Learning from Disaster 这个问题中，要解决的是根据所提供的 age，sex 等因素的数据，判断哪些乘客更有可能生存下来，所以这是一个分类问题。

## 1. Data Exploration

``````import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
``````
``````
``````

``````train.tail()
train.describe()
``````

## 2. Data Cleaning

``````train.isnull().sum()
#test.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
``````

## 3. Feature Engineering

``````pclass          Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)
``````

• Name 里可以抓取到这样的字眼，来反映出乘客的职场地位： ['Capt', 'Col', 'Major', 'Dr', 'Officer', 'Rev']。
• Cabin 里的 [a-zA-Z] 也许可以反映出社会地位。
• Cabin 里的 [0-9] 可能代表船舱的地理位置。
• SibSp 可以算出乘客中同一家庭成员人数的大小。
``````title[title.isin(['Capt', 'Col', 'Major', 'Dr', 'Officer', 'Rev'])] = 'Officer'

deck = full[~full.Cabin.isnull()].Cabin.map( lambda x : re.compile("([a-zA-Z]+)").search(x).group())

checker = re.compile("([0-9]+)")

full['Group_num'] = full.Parch + full.SibSp + 1
``````

``````train = pd.get_dummies(train, columns=['Embarked', 'Pclass', 'Title', 'Group_size'])

full['NorFare'] = pd.Series(scaler.fit_transform(full.Fare.reshape(-1,1)).reshape(-1), index=full.index)

full.drop(labels=['PassengerId', 'Name', 'Cabin', 'Survived', 'Ticket', 'Fare'], axis=1, inplace=True)
``````

## 4. Model Building

``````from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
``````

``````from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score

scoring = make_scorer(accuracy_score, greater_is_better=True)

def get_model(estimator, parameters, X_train, y_train, scoring):
model = GridSearchCV(estimator, param_grid=parameters, scoring=scoring)
model.fit(X_train, y_train)
return model.best_estimator_
``````

• 从 sklearn 导入分类器模型后，定义一个 KNN，
• 定义合适的参数集 parameters，
• 然后用 get_model 去训练 KNN 模型，
• 接下来用训练好的模型去预测测试集的数据，并得到 accuracy_score，
• 然后画出 learning_curve。
``````from sklearn.neighbors import KNeighborsClassifier
KNN = KNeighborsClassifier(weights='uniform')
parameters = {'n_neighbors':[3,4,5], 'p':[1,2]}
clf_knn = get_model(KNN, parameters, X_train, y_train, scoring)

print (accuracy_score(y_test, clf_knn.predict(X_test)))
plot_learning_curve(clf_knn, 'KNN', X, y, cv=4);
``````

``````KNN， 0.816143497758
Random Forest， 0.829596412556

Logistic Regression， 0.811659192825
SVC， 0.838565022422
XGBoost， 0.820627802691
``````

## 5. Ensemble

``````from sklearn.ensemble import VotingClassifier
clf_vc = VotingClassifier(estimators=[('xgb1', clf_xgb1), ('lg1', clf_lg1), ('svc', clf_svc),
('rfc1', clf_rfc1),('rfc2', clf_rfc2), ('knn', clf_knn)],
voting='hard', weights=[4,1,1,1,1,2])
clf_vc = clf_vc.fit(X_train, y_train)

print (accuracy_score(y_test, clf_vc.predict(X_test)))
plot_learning_curve(clf_vc, 'Ensemble', X, y, cv=4);
``````
``````ensemble, 0.825112107623
``````

## 6. Prediction

``````def submission(model, fname, X):
ans = pd.DataFrame(columns=['PassengerId', 'Survived'])
ans.PassengerId = PassengerId
ans.Survived = pd.Series(model.predict(X), index=ans.index)
ans.to_csv(fname, index=False)``````

273 篇文章43 人订阅

0 条评论

## 相关文章

1.1K3

### 【干货】Python大数据处理库PySpark实战——使用PySpark处理文本多分类问题

【导读】近日，多伦多数据科学家Susan Li发表一篇博文，讲解利用PySpark处理文本多分类问题的详情。我们知道，Apache Spark在处理实时数据方面...

9.8K9

2723

### Kaggle泰坦尼克号船难--逻辑回归预测生存率

（三）需要的库：numpy + pandas + matplotlib + sklearn Win 10安装numpy、pandas、scipy、matplot...

1K4

### 【重磅】深度学习顶会ICLR2018评审结果出炉,一文快速了解评审分析简报和评分最高的十篇论文

【导读】ICLR，全称为「International Conference on Learning Representations」（国际学习表征会议），201...

3335

1.1K10

5063

8710

### 正六边形网格化(Hexagonal Grids)原理与实现

在路径规划、游戏设计栅格法应用中，正六边形网格不如矩形网格直接和常见，但是正六边形具有自身的应用特点，更适用于一些特殊场景中，比如旷阔的海洋、区域或者太空。...

4495