集成学习(ensemble learning)可以说是现在非常火爆的机器学习方法了。它本身不是一个单独的机器学习算法,而是通过构建并结合多个机器学习器来完成学习任务。也就是我们常说的“博采众长”。集成学习可以用于分类问题集成,回归问题集成,特征选取集成,异常点检测集成等等,可以说所有的机器学习领域都可以看到集成学习的身影。本文就对集成学习的原理做一个总结。
在上面几节里面我们主要关注于学习器,提到了学习器的结合策略但没有细讲,本节就对集成学习之结合策略做一个总结。我们假定我得到的T个弱学习器是h_1,h_2,...h_T
平均法
对于数值类的回归预测问题,通常使用的结合策略是平均法,也就是说,对于若干个弱学习器的输出进行平均得到最终的预测输出。
,则最终预测是
是个体学习器hiℎ𝑖的权重,通常有
对于分类问题的预测,我们通常使用的是投票法。假设我们的预测类别是h_1,h_2,...h_T。对于任意一个预测样本x,我们的T个弱学习器的预测结果分别是
。
最简单的投票法是相对多数投票法,也就是我们常说的少数服从多数,也就是T个弱学习器的对样本x的预测结果中,数量最多的类别ci𝑐𝑖为最终的分类类别。如果不止一个类别获得最高票,则随机选择一个做最终类别。
稍微复杂的投票法是绝对多数投票法,也就是我们常说的要票过半数。在相对多数投票法的基础上,不光要求获得最高票,还要求票过半数。否则会拒绝预测。
更加复杂的是加权投票法,和加权平均法一样,每个弱学习器的分类票数要乘以一个权重,最终将各个类别的加权票数求和,最大的值对应的类别为最终类别。
上两节的方法都是对弱学习器的结果做平均或者投票,相对比较简单,但是可能学习误差较大,于是就有了学习法这种方法,对于学习法,代表方法是stacking,当使用stacking的结合策略时, 我们不是对弱学习器的结果做简单的逻辑处理,而是再加上一层学习器,也就是说,我们将训练集弱学习器的学习结果作为输入,将训练集的输出作为输出,重新训练一个学习器来得到最终结果。
在这种情况下,我们将弱学习器称为初级学习器,将用于结合的学习器称为次级学习器。对于测试集,我们首先用初级学习器预测一次,得到次级学习器的输入样本,再用次级学习器预测一次,得到最终的预测结果。
爬山算法(Hill Climbing Algorithm), 又称爬坡算法,是一种针对解决最优化问题的常用算法之一。其过程类比登山时爬上一座山峰的过程,通过朝着“高处”移动来逐渐接近顶峰,达到找到最优解的目的。
爬山算法的基本思路是通过从初始解开始,反复地对当前解进行微调,每次改进当前解的一点点,直到某个条件满足而停止。这个过程中需要注意的是,在备选解集合中选择出的解是否为最优解并不确定,只能保证是局部最优解。
爬山算法的通用流程如下:
爬山算法在寻找全局最优解方面有明显的不足,容易陷入局部最优解,因此通常需要采用一些策略来避免这种情况的发生。
为了解决爬山算法可能陷入局部最优解的问题,可以采用以下改进策略:
随机化重启爬山算法(Random-restart Hill Climbing Algorithm)是通过多次运行爬山算法,每次使用不同的初始解来尝试得到更好的全局最优解。
模拟退火算法(Simulated Annealing Algorithm)是将“退火”过程的思想应用到搜索问题中来。它通过对解空间进行随机扰动来达到跳出局部最优解的效果。
遗传算法(Genetic Algorithm)是模拟自然界中的优胜劣汰过程,通过产生新一代的解并在其间进行筛选,来实现全局优化的过程。
首先我们读取数据(由于本次重点介绍集成算法策略,因此数据并不是重点。任意一份数据都可以。如果需要本次使用的数据,可以在@公众号:数据STUDIO 后台联系云朵君免费获取)。
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
print(f"train:{train.shape}, test:{test.shape}")
train:(101763, 23), test:(67842, 22)
train.describe()
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101763 entries, 0 to 101762
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 101763 non-null int64
1 loc 101763 non-null float64
2 v(g) 101763 non-null float64
3 ev(g) 101763 non-null float64
4 iv(g) 101763 non-null float64
5 n 101763 non-null float64
6 v 101763 non-null float64
7 l 101763 non-null float64
8 d 101763 non-null float64
9 i 101763 non-null float64
10 e 101763 non-null float64
11 b 101763 non-null float64
12 t 101763 non-null float64
13 lOCode 101763 non-null int64
14 lOComment 101763 non-null int64
15 lOBlank 101763 non-null int64
16 locCodeAndComment 101763 non-null int64
17 uniq_Op 101763 non-null float64
18 uniq_Opnd 101763 non-null float64
19 total_Op 101763 non-null float64
20 total_Opnd 101763 non-null float64
21 branchCount 101763 non-null float64
22 defects 101763 non-null bool
dtypes: bool(1), float64(17), int64(5)
memory usage: 17.2 MB
"训练"和 "测试"数据集中都没有缺失值。由于这是一个合成数据集,我们将检查是否有重复数据。
print(f"如果我们从训练数据集中移除 id,维度为 {train.drop(columns = ['id'], axis = 1).drop_duplicates().shape}")
print(f"如果从训练数据集中删除 id 和缺陷,维度为 {train.drop(columns = ['id', 'defects'], axis = 1).drop_duplicates().shape}")
print(f"如果从训练数据集中删除 id、分支计数和缺陷,维度为 {train.drop(columns = ['id', 'branchCount', 'defects'], axis = 1).drop_duplicates().shape}")
如果我们从训练数据集中移除 id,维度为 (101763, 22)
如果从训练数据集中删除 id 和缺陷,维度为 (101763, 21)
如果从训练数据集中删除 id、分支计数和缺陷,维度为 (101685, 20)
从上面我们可以看到,"训练" 数据集中有 78 个准重复观测值。现在,让我们来看看 "测试" 数据集。
train['defects'].value_counts(normalize = True).plot(kind = 'bar', color = ['steelblue', 'orange'])
plt.ylabel('Percentage');
从上面的柱状图中,我们可以看到数据是不平衡的(~77% 为假,~23% 为真)。接下来,我们继续探索输入特征之间的潜在相关性。
corr_mat = train.drop(columns = ['id', 'defects'], axis = 1).corr()
data_mask = np.triu(np.ones_like(corr_mat, dtype = bool))
cmap = sns.diverging_palette(100, 7, s = 75, l = 40, n = 20, center = 'light', as_cmap = True)
f, ax = plt.subplots(figsize = (18, 13),dpi=150)
sns.heatmap(corr_mat, annot = True, cmap = cmap, fmt = '.2f', center = 0,
annot_kws = {'size': 12}, mask = data_mask).set_title('输入特征之间的相关性');
可以看出以下几点:
branchCount
和 v(g)
之间的相关性为 97%total_Opnd
和 total_Op
之间的相关性为 96%total_Op
和 n
之间的相关性为 96%l
是唯一与其他特征负相关的特征根据上述相关性热图,我们继续探索通过 "PCA" 降维的想法。请注意,所有输入特征都是右偏的,因此在运行 "PCA" 之前,我们先对特征进行 "对数变换",然后再应用 "PCA"。
colnames = train.drop(columns = ['id', 'defects'], axis = 1).columns.tolist()
pca_md = Pipeline([('log-tran', ColumnTransformer([('log', FunctionTransformer(np.log1p), colnames)])),
('stand', StandardScaler()),
('pca', PCA())]).fit(train[colnames])
pca_md
plt.figure(figsize = (10, 8),dpi=150)
ax = sns.lineplot(x = [i for i in range(1, 22)], y = np.cumsum(pca_md['pca'].explained_variance_ratio_), color = 'steelblue', markers = True);
ax.set_xlabel('维度数')
ax.set_ylabel('解释性方差(%)')
ax.set_xticks([i for i in range(1, 22)]);
从上面我们可以看出,前 10 个成分可以解释数据中 99% 以上的变化。接下来,让我们对这些成分进行可视化分析,看看是否存在可以利用的模式。
pca_10 = Pipeline([('log-tran', ColumnTransformer([('log', FunctionTransformer(np.log1p), colnames)])),
('stand', StandardScaler()),
('pca', PCA(n_components = 10))]).fit_transform(train[colnames])
pca_10 = pd.DataFrame(pca_10, columns = [str('PCA_') + str(i) for i in range(1, 11)])
pca_10['defects'] = train['defects'].map({False: 0, True: 1})
sns.pairplot(data = pca_10, hue = 'defects', corner = True)
从上述 "PCA" 配对图中,我们可以观察到以下几点:
PCA_1
和 PCA_2
图,在左下角,有一些蓝色样本("defects = True"),而该区域的大多数样本都是红色的("defects = False")。接下来,我们运行 "k-means",看看是否有有趣的模式可以利用。
inertias = list()
for i in tqdm(range(2, 21)):
kmeans_md = Pipeline([('log-tran', ColumnTransformer([('log', FunctionTransformer(np.log1p), colnames)])),
('stand', StandardScaler()),
('kmeans', KMeans(n_clusters = i, n_init = 20, random_state = 42))]).fit(train[colnames])
inertias.append(kmeans_md['kmeans'].inertia_)
plt.figure(figsize = (10, 8),dpi=150)
ax = sns.lineplot(x = [i for i in range(2, 21)], y = inertias, color = 'steelblue')
plt.xlabel('Number of Clusters')
plt.ylabel('Cluster Inertia')
add_name(ax);
从上图来看,根据聚类间距,5 个聚类似乎是该数据集合适的聚类数量(基于肘部方法)。接下来,我们探讨每个聚类中的 "defects" 比例。
kmeans = Pipeline([('log-tran', ColumnTransformer([('log', FunctionTransformer(np.log1p), colnames)])),
('stand', StandardScaler()),
('kmeans', KMeans(n_clusters = 5, n_init = 20, random_state = 42))]).fit(train[colnames])
train['cluster'] = kmeans['kmeans'].labels_
print('每个群组中的defects比例为\n')
print(train.groupby('cluster')['defects'].mean())
每个群组中的defects比例为
cluster
0 0.109429
1 0.471916
2 0.641147
3 0.077083
4 0.301682
Name: defects, dtype: float64
可以看出以下几点:
接下来,我们简要探讨几个二元关系,如下所示。
fig, axes = plt.subplots(1, 2, figsize = (20,8), dpi=150)
ax1 = sns.scatterplot(ax = axes[0], data = train, x = 'uniq_Op', y = 'uniq_Opnd', hue = 'defects');
ax2 = sns.scatterplot(ax = axes[1], data = train, x = 'total_Op', y = 'total_Opnd', hue = 'defects');
从上面的图表中,我们可以看出以下几点:
接下来,我们计算每个输入特征中唯一值的数量,如下所示。
train.drop(columns = ['id', 'defects'], axis = 1).nunique()
loc 378
v(g) 106
ev(g) 71
iv(g) 84
n 836
v 4515
l 55
d 3360
i 5171
e 8729
b 315
t 8608
lOCode 298
lOComment 91
lOBlank 94
locCodeAndComment 29
uniq_Op 70
uniq_Opnd 176
total_Op 623
total_Opnd 485
branchCount 144
cluster 5
dtype: int64
可以看出,"locCodeAndComment" 是唯一值最少的特征;它只有 29 个唯一值。
我用了如下6个模型完成了一次集成,效果惊人
首先,我们在不进行特征工程或 HPO 的情况下建立一些标准模型。首先,我们定义输入特征和目标特征。
def hill_climbing(x, y, x_test):
# Evaluating oof predictions
scores = {}
for col in x.columns:
scores[col] = roc_auc_score(y, x[col])
# Sorting the model scores
scores = {k: v for k, v in sorted(scores.items(), key = lambda item: item[1], reverse = True)}
# Sort oof_df and test_preds
x = x[list(scores.keys())]
x_test = x_test[list(scores.keys())]
STOP = False
current_best_ensemble = x.iloc[:,0]
current_best_test_preds = x_test.iloc[:,0]
MODELS = x.iloc[:,1:]
weight_range = np.arange(-0.5, 0.51, 0.01)
history = [roc_auc_score(y, current_best_ensemble)]
j = 0
while not STOP:
j += 1
potential_new_best_cv_score = roc_auc_score(y, current_best_ensemble)
k_best, wgt_best = None, None
for k in MODELS:
for wgt in weight_range:
potential_ensemble = (1 - wgt) * current_best_ensemble + wgt * MODELS[k]
cv_score = roc_auc_score(y, potential_ensemble)
if cv_score > potential_new_best_cv_score:
potential_new_best_cv_score = cv_score
k_best, wgt_best = k, wgt
if k_best is not None:
current_best_ensemble = (1 - wgt_best) * current_best_ensemble + wgt_best * MODELS[k_best]
current_best_test_preds = (1 - wgt_best) * current_best_test_preds + wgt_best * x_test[k_best]
MODELS.drop(k_best, axis = 1, inplace = True)
if MODELS.shape[1] == 0:
STOP = True
history.append(potential_new_best_cv_score)
else:
STOP = True
hill_ens_pred_1 = current_best_ensemble
hill_ens_pred_2 = current_best_test_preds
return [hill_ens_pred_1, hill_ens_pred_2]
接下来,我们通过 10 倍交叉验证建立了几个标准模型。
上下滑动查看更多源码
ens_cv_scores, ens_preds = list(), list()
hill_ens_cv_scores, hill_ens_preds = list(), list()
sk = RepeatedStratifiedKFold(n_splits = 10, n_repeats = 1, random_state = 42)
for i, (train_idx, test_idx) in enumerate(sk.split(X, Y)):
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
Y_train, Y_test = Y.iloc[train_idx], Y.iloc[test_idx]
print('----------------------------------------------------------')
########
## RF ##
########
RF_md = RandomForestClassifier(n_estimators = 500,
max_depth = 7,
min_samples_split = 15,
min_samples_leaf = 10).fit(X_train, Y_train)
RF_pred = RF_md.predict_proba(X_test)[:, 1]
RF_score = roc_auc_score(Y_test, RF_pred)
print('Fold', i, '==> RF oof ROC-AUC score is ==>', RF_score)
RF_pred_test = RF_md.predict_proba(test_cv)[:, 1]
#################
## Extra Trees ##
#################
ET_md = ExtraTreesClassifier(n_estimators = 500,
max_depth = 7,
min_samples_split = 15,
min_samples_leaf = 10).fit(X_train, Y_train)
ET_pred = ET_md.predict_proba(X_test)[:, 1]
ET_score = roc_auc_score(Y_test, ET_pred)
print('Fold', i, '==> ET oof ROC-AUC score is ==>', ET_score)
ET_pred_test = ET_md.predict_proba(test_cv)[:, 1]
##########################
## HistGradientBoosting ##
##########################
hist_md = HistGradientBoostingClassifier(l2_regularization = 0.01,
early_stopping = False,
learning_rate = 0.01,
max_iter = 500,
max_depth = 5,
max_bins = 255,
min_samples_leaf = 15,
max_leaf_nodes = 10).fit(X_train, Y_train)
hist_pred = hist_md.predict_proba(X_test)[:, 1]
hist_score = roc_auc_score(Y_test, hist_pred)
print('Fold', i, '==> Hist oof ROC-AUC score is ==>', hist_score)
hist_pred_test = hist_md.predict_proba(test_cv)[:, 1]
##########
## LGBM ##
##########
LGBM_md = LGBMClassifier(objective = 'binary',
n_estimators = 500,
max_depth = 7,
learning_rate = 0.01,
num_leaves = 20,
reg_alpha = 3,
reg_lambda = 3,
subsample = 0.7,
colsample_bytree = 0.7).fit(X_train, Y_train)
lgb_pred = LGBM_md.predict_proba(X_test)[:, 1]
lgb_score = roc_auc_score(Y_test, lgb_pred)
print('Fold', i, '==> LGBM oof ROC-AUC score is ==>', lgb_score)
lgb_pred_test = LGBM_md.predict_proba(test_cv)[:, 1]
#########
## XGB ##
#########
XGB_md = XGBClassifier(objective = 'binary:logistic',
tree_method = 'hist',
colsample_bytree = 0.7,
gamma = 2,
learning_rate = 0.01,
max_depth = 7,
min_child_weight = 10,
n_estimators = 500,
subsample = 0.7).fit(X_train, Y_train)
xgb_pred = XGB_md.predict_proba(X_test)[:, 1]
xgb_score = roc_auc_score(Y_test, xgb_pred)
print('Fold', i, '==> XGB oof ROC-AUC score is ==>', xgb_score)
xgb_pred_test = XGB_md.predict_proba(test_cv)[:, 1]
##############
## CatBoost ##
##############
Cat_md = CatBoostClassifier(loss_function = 'Logloss',
iterations = 500,
learning_rate = 0.01,
depth = 7,
random_strength = 0.5,
bagging_temperature = 0.7,
border_count = 30,
l2_leaf_reg = 5,
verbose = False,
task_type = 'CPU').fit(X_train, Y_train)
cat_pred = Cat_md.predict_proba(X_test)[:, 1]
cat_score = roc_auc_score(Y_test, cat_pred)
print('Fold', i, '==> CatBoost oof ROC-AUC score is ==>', cat_score)
cat_pred_test = Cat_md.predict_proba(test_cv)[:, 1]
##############
## Ensemble ##
##############
ens_pred_1 = (RF_pred + ET_pred + hist_pred + lgb_pred + xgb_pred + cat_pred) / 6
ens_pred_2 = (RF_pred_test + ET_pred_test + hist_pred_test + lgb_pred_test + xgb_pred_test + cat_pred_test) / 6
ens_score_fold = roc_auc_score(Y_test, ens_pred_1)
ens_cv_scores.append(ens_score_fold)
ens_preds.append(ens_pred_2)
print('Fold', i, '==> Average Ensemble oof ROC-AUC score is ==>', ens_score_fold)
############################
## Hill Climbing Ensemble ##
############################
x = pd.DataFrame({'RF': RF_pred,
'ET': ET_pred,
'Hist': hist_pred,
'LGBM': lgb_pred,
'XGB': xgb_pred,
'Cat': cat_pred})
y = Y_test
x_test = pd.DataFrame({'RF': RF_pred_test,
'ET': ET_pred_test,
'Hist': hist_pred_test,
'LGBM': lgb_pred_test,
'XGB': xgb_pred_test,
'Cat': cat_pred_test})
hill_results = hill_climbing(x, y, x_test)
hill_ens_score_fold = roc_auc_score(y, hill_results[0])
hill_ens_cv_scores.append(hill_ens_score_fold)
hill_ens_preds.append(hill_results[1])
print('Fold', i, '==> Hill Climbing Ensemble oof ROC-AUC score is ==>', hill_ens_score_fold)1
print('10-folds 的 ROC-AUC 平均集合得分是', np.mean(ens_cv_scores))
print('10-folds 的 ROC-AUC 爬山策略得分是', np.mean(hill_ens_cv_scores))
10-folds 的 ROC-AUC 平均集合得分是0.7917392082505299
10-folds 的 ROC-AUC 爬山策略得分是0.7932600401008065
下面给大家总结了一个爬山策略的工程代码,大家可酌情使用。如果你觉得有用,不妨点个赞~
上下滑动查看更多源码
import pandas as pd
import numpy as np
import os
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import minmax_scale
"""
Initializes an empty ensemble
"""
def init_hillclimb():
best_ensemble = {}
for label in LABELS:
best_ensemble[label] = []
best_score = {}
for label in LABELS:
best_score[label] = 0
return best_ensemble, best_score
"""
Scores average AUC for an ensemble per label
"""
def score_ensemble(ensemble, label):
blend_preds = np.zeros(len(train))
for model in ensemble:
blend_preds += minmax_scale(files[model][label])
blend_preds = blend_preds/len(subnums)
score = roc_auc_score(train[label], blend_preds)
return score
"""
Finds the optimal model to add next to an ensemble
"""
def find_best_improvement(ensemble, label):
best_score = 0
best_ensemble = []
for i in range(0,len(files)):
ensemble = ensemble + [i]
score = score_ensemble(ensemble, label)
if score > best_score:
best_score = score
best_ensemble = ensemble
ensemble = ensemble[:-1]
return best_ensemble, best_score
"""
Performs a step for each label
"""
def climb(best_ensemble, best_score):
for label in LABELS:
best_ensemble[label], best_score[label] = find_best_improvement(best_ensemble[label], label)
return best_ensemble, best_score
"""
Gets optimal blending weights after hillclimb
"""
def get_optimal_weights(best_ensemble):
weights = {}
for label in LABELS:
weights[label] = {}
for num in set(best_ensemble[label]):
weights[label][num] = best_ensemble[label].count(num)/len(best_ensemble[label])
return weights
"""
Constructs a pandas dataframe using the optimal blending weights
"""
def get_optimal_blend(optimal_weights):
sub = pd.read_csv(os.path.join(DATA_PATH, "sample_submission.csv"))
blend = sub.copy()
for label in LABELS:
print(label)
for key in optimal_weights[label]:
blend[label] += optimal_weights[label][key] * minmax_scale(get_sub_file(key)[label])
print(optimal_weights[label][key], filenames[key])
blend[label] = minmax_scale(blend[label])
return blend
def get_sub_file(num):
filename = filenames[num]
filename = filename.replace("oof", "sub")
return pd.read_csv(os.path.join(SUB_PATH, filename))
if __name__ == "__main__":
DATA_PATH = "../input/jigsaw-toxic-comment-classification-challenge/"
SUB_PATH = "../input/oofmodels/trained-models/"
train = pd.read_csv(os.path.join(DATA_PATH, "train.csv")).fillna(' ')
test = pd.read_csv(os.path.join(DATA_PATH, "test.csv")).fillna(' ')
LABELS = train.columns[2:]
# Get submissions and out-of-fold predictions
subnums = [29,51,52]
filenames = ["oof" + str(num) + ".csv" for num in subnums]
files = [pd.read_csv(os.path.join(SUB_PATH, filename)) for filename in filenames]
best_ensemble, best_score = init_hillclimb()
# Run hillclimb
for i in range(25):
print("-------------")
print("Step", i)
best_ensemble, best_score = climb(best_ensemble, best_score)
print("Best ensemble:")
print(best_ensemble)
print("Best score:")
print(best_score)
print("AUC:", np.mean([best_score[label] for label in LABELS]))
# Get optimal weights
opt_w = get_optimal_weights(best_ensemble)
print("Optimal weights:")
print(opt_w)
# Construct the blend
blend = get_optimal_blend(opt_w)
# Submit
blend.to_csv("hillclimb.csv", index=False)