XGBoost 模型对比随机决策森林以及XGBoost模型对泰坦尼克号上的乘客是否生还的预测能力
提升分类器隶属于集成学习模型。它的基本思想是把成百上千个分类准确率较低的树模型组合起来,成为一个准确率很高的模型。这个模型的特点在于不断迭代,每次迭代就生成一颗新的树。对于如何在每一步生成合理的树有多种办法,例如在集成(分类)模型中提到的梯度提升树。它在生成每一棵树的时候采用梯度下降的思想,以之前生成的所有决策树为基础,向着最小化给定目标函数的方向再进一步。在合理的参数设置下,往往要生成一定数量的树才能达到令人满意的准确率。在数据集较大较复杂的候,模型可能需要几千次迭代运算。但是,XGBoost工具更好地解决这个问题。XGBoost 的全称是eXtreme Gradient Boosting。
#导入pandas用于数据分析。
import pandas as pd
titanic = pd.read_csv('titanic.txt')
#选取pclass、age以及sex作为训练特征。
x = titanic[['pclass', 'age', 'sex'] ]
y = titanic[ 'survived']
#对缺失的age信息,采用平均值方法进行补全,即以age列已知数据的平均数填充。
x['age'].fillna(x['age'].mean(),inplace= True)
#对原数据进行分割,随机采样25%作为测试集。
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25,random_state = 33)
#从sklearn. feature extraction 导人DictVectorizer.
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer (sparse= False)
#对原数据进行特征向量化处理。
x_train = vec.fit_transform(x_train.to_dict(orient = 'record'))
x_test = vec.transform(x_test.to_dict (orient = 'record'))
#采用默认配置的随机森林分类器对测试集进行预测。
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier ()
rfc.fit(x_train, y_train)
print('The accuracy of Random Forest Classifier on testing set:',
rfc.score (x_test, y_test))
The accuracy of Random Forest Classifier on testing set: 0.7781155015197568
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py:5434: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._update_inplace(new_data)
#采用默认配置的xGBoost模型对相同的测试集进行预测。
from xgboost import XGBClassifier
xgbc = XGBClassifier ()
xgbc.fit(x_train, y_train)
XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints=None,
learning_rate=0.300000012, max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=0, num_parallel_tree=1,
objective='binary:logistic', random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method=None,
validate_parameters=False, verbosity=None)
print('The accuracy of eXtreme Gradient Boosting Classifier on testing set:',
xgbc.score(x_test, y_test))
The accuracy of eXtreme Gradient Boosting Classifier on testing set: 0.7750759878419453
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff: