scikit-learn 框架提供了搜索参数组合的功能。 此功能在 GridSearchCV 类中提供,可用于发现配置模型以获得最佳表现的最佳方法。 例如,我们可以定义一个树的数量(n_estimators)和树大小(max_depth)的网格,通过将网格定义为:
1n_estimators = [50, 100, 150, 200]
2max_depth = [2, 4, 6, 8]
3param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)
然后使用 10 倍交叉验证评估每个参数组合:
1kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)
2grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold, verbose=1)
3result = grid_search.fit(X, label_encoded_y)
然后,我们可以查看结果,以确定最佳组合以及改变参数组合的一般趋势。 这是将 XGBoost 应用于您自己的问题时的最佳做法。要考虑调整的参数是:
下面是调整 Pima Indians Onset of Diabetes 数据集中 learning_rate 的完整示例。
1# Tune learning_rate
2from numpy import loadtxt
3from xgboost import XGBClassifier
4from sklearn.model_selection import GridSearchCV
5from sklearn.model_selection import StratifiedKFold
6# load data
7dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
8# split data into X and y
9X = dataset[:,0:8]
10Y = dataset[:,8]
11# grid search
12model = XGBClassifier()
13learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]
14param_grid = dict(learning_rate=learning_rate)
15kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)
16grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold)
17grid_result = grid_search.fit(X, Y)
18# summarize results
19print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
20means = grid_result.cv_results_['mean_test_score']
21stds = grid_result.cv_results_['std_test_score']
22params = grid_result.cv_results_['params']
23for mean, stdev, param in zip(means, stds, params):
24 print("%f (%f) with: %r" % (mean, stdev, param))
课程已经全部更新完成,花点时间回顾一下你走了多远:
不要轻视这一点,你在很短的时间内走了很长的路。这只是您在 Python 中使用 XGBoost 的旅程的开始。继续练习和发展你的技能。
推荐阅读:
完