在本节中,您将使用机器学习算法解决泰坦尼克号预测问题:Logistic回归。 Logistic回归是一种分类算法,涉及预测事件的结果,例如乘客是否能够在泰坦尼克号灾难中幸存。
1912年4月15日,在首次航行期间,泰坦尼克号撞上冰山后沉没,2224名乘客和机组人员中有1502人遇难。这场轰动的悲剧震惊国际社会,在这次海难中导致死亡率高的原因之一是没有足够的救生艇给乘客和机组人员,虽然幸存下来有一部分的运气因素,但是还是有一些人比其他人的生存下来的可能性更高,比如妇女、儿童和上层阶级的人士。在这个学习之中,我们将用逻辑回归来预测一些人生存的可能性。用机器学习来预测哪些乘客能更幸免于难。在此用到的编程语言是Python。
%reset -f
%clear
# In[*]
import pandas as pd
from sklearn import linear_model
from sklearn import preprocessing
import os
os.chdir('D:\\train\\all')
# In[*]
# read the data
df = pd.read_csv("train.csv")
我个人的习惯是每一步都看一下数据框,以验证数据是否正确加载。
# drop the columns that are not useful to us
df = df.drop('PassengerId', axis=1)
# axis=1 means column
df = df.drop('Name', axis=1)
df = df.drop('Ticket', axis=1)
df = df.drop('Cabin', axis=1)
# initialize label encoder
label_encoder = preprocessing.LabelEncoder()
# convert Sex and Embarked features to numeric
sex_encoded = label_encoder.fit_transform(df["Sex"])
print(sex_encoded)
# 0 = female
# 1 = male
df['Sex'] = sex_encoded
embarked_encoded = label_encoder.fit_transform(df["Embarked"])
print(embarked_encoded)
# 0 = C
# 1 = Q
# 2 = S
df['Embarked'] = embarked_encoded
print(df.head())
请注意,Sex和Embarked字段的值现在已替换为编码值。
要使字段分类,请使用Pandas中的Categorical类:
# In[*]
# make fields categorical
df["Pclass"] = pd.Categorical(df["Pclass"])
df["Sex"] = pd.Categorical(df["Sex"])
df["Embarked"] = pd.Categorical(df["Embarked"])
df["Survived"] = pd.Categorical(df["Survived"])
print(df.dtypes) # examine the datatypes
# for each feature
Survived category
Pclass category
Sex category
Age float64
SibSp int64
Parch int64
Fare float64
Embarked category
dtype: object
# In[*]
# we use all columns except Survived as
# features for training
features = df.drop('Survived',1)
# the label is Survived
label = df['Survived']
from sklearn.model_selection import train_test_split
# split the dataset into train and test sets
train_features,test_features, train_label,test_label = train_test_split(features,
label,
test_size = 0.25, # split ratio
random_state = 1, # Set random seed
stratify = df["Survived"])
# Training set
print(train_features.head())
print(train_label)