XGBoost简单实践

用户3577892

发布于 2020-06-11 12:17:01

4980

发布于 2020-06-11 12:17:01

文章被收录于专栏：数据科学CLUB

XGBoost 模型对比随机决策森林以及XGBoost模型对泰坦尼克号上的乘客是否生还的预测能力

XGBoost 模型

提升分类器隶属于集成学习模型。它的基本思想是把成百上千个分类准确率较低的树模型组合起来,成为一个准确率很高的模型。这个模型的特点在于不断迭代,每次迭代就生成一颗新的树。对于如何在每一步生成合理的树有多种办法，例如在集成(分类)模型中提到的梯度提升树。它在生成每一棵树的时候采用梯度下降的思想,以之前生成的所有决策树为基础,向着最小化给定目标函数的方向再进一步。在合理的参数设置下,往往要生成一定数量的树才能达到令人满意的准确率。在数据集较大较复杂的候，模型可能需要几千次迭代运算。但是,XGBoost工具更好地解决这个问题。XGBoost 的全称是eXtreme Gradient Boosting。

对比随机决策森林以及XGBoost模型对泰坦尼克号上的乘客是否生还的预测能力

#导入pandas用于数据分析。
import pandas as pd
titanic = pd.read_csv('titanic.txt')
#选取pclass、age以及sex作为训练特征。
x = titanic[['pclass', 'age', 'sex'] ]
y = titanic[ 'survived']
#对缺失的age信息,采用平均值方法进行补全，即以age列已知数据的平均数填充。
x['age'].fillna(x['age'].mean(),inplace= True)
#对原数据进行分割，随机采样25%作为测试集。
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25,random_state = 33)
#从sklearn. feature extraction 导人DictVectorizer.
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer (sparse= False)
#对原数据进行特征向量化处理。
x_train = vec.fit_transform(x_train.to_dict(orient = 'record'))
x_test = vec.transform(x_test.to_dict (orient = 'record'))
#采用默认配置的随机森林分类器对测试集进行预测。
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier ()
rfc.fit(x_train, y_train)
print('The accuracy of Random Forest Classifier on testing set:',
      rfc.score (x_test, y_test))

The accuracy of Random Forest Classifier on testing set: 0.7781155015197568


C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py:5434: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)

#采用默认配置的xGBoost模型对相同的测试集进行预测。
from xgboost import XGBClassifier
xgbc = XGBClassifier ()
xgbc.fit(x_train, y_train)

XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
       importance_type='gain', interaction_constraints=None,
       learning_rate=0.300000012, max_delta_step=0, max_depth=6,
       min_child_weight=1, missing=nan, monotone_constraints=None,
       n_estimators=100, n_jobs=0, num_parallel_tree=1,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method=None,
       validate_parameters=False, verbosity=None)

print('The accuracy of eXtreme Gradient Boosting Classifier on testing set:',
      xgbc.score(x_test, y_test))

The accuracy of eXtreme Gradient Boosting Classifier on testing set: 0.7750759878419453


C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2020-03-03，如有侵权请联系 cloudcommunity@tencent.com 删除

机器学习