【一文搞懂：什么是集成学习--原理+python代码】

机器学习司猫白

发布于 2025-01-21 17:42:29

19400

代码可运行

文章被收录于专栏：机器学习实战机器学习实战

运行总次数：0

代码可运行

集成学习思路

• 多模型融合。

• ⼀个模型解决不了的问题，多个模型⼀起。

数据准备：make_moons

类似于⽉⽛形状的数据集，通常⽤于机器学习中的分类问题。

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
# 随机选择500个数据点，为了演⽰模型效果，加⼊了较⾼的噪声数据noise 
X, y = make_moons(n_samples=500, noise=0.3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
plt.plot(X[:, 0][y == 0], X[:, 1][y == 0], 'ro', alpha=0.6)
plt.plot(X[:, 0][y == 1], X[:, 1][y == 1], 'bs', alpha=0.6)
plt.show()

投票策略-Voting

• 硬投票：直接⽤类别值，少数服从多数。

• 软投票：各⾃分类器的概率值进⾏加权平均（要求每个分类器都能计算出概率值）。

硬投票

对于某个样本，三种模型的预测结果为“红红蓝”，红：蓝=2：1，则硬投票预测结果为“红”。

以下⽰例中，使⽤随机森林、逻辑回归和⽀持向量机分别进⾏预测，得到每个分类器的准确率。

代码⽰例：

from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
log_clf = LogisticRegression()
rfc_clf = RandomForestClassifier()
svm_clf = SVC()
# 硬投票 
voting_clf = VotingClassifier(estimators=[('lr', log_clf), ('rf', rfc_clf), 
('svc', svm_clf)], voting='hard')

for clf in (log_clf, rfc_clf, svm_clf, voting_clf):
 clf.fit(X_train, y_train)
 y_pred = clf.predict(X_test)
 print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

代码输出：

LogisticRegression 0.864

RandomForestClassifier 0.896

SVC 0.896

VotingClassifier 0.904

可以看出，硬投票的准确率0.904，⾼于其它单模型。

软投票

对于某个样本，三种模型的预测结果（概率）依次如下，根据概率判断类别。

代码⽰例

from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

from sklearn.metrics import accuracy_score
log_clf = LogisticRegression()
rfc_clf = RandomForestClassifier()
# ⽀持向量机默认不计算概率值，需要显⽰传参probability=True 
svm_clf = SVC(probability=True)
# 修改voting参数为soft 
voting_clf = VotingClassifier(estimators=[('lr', log_clf), ('rf', rfc_clf), 
('svc', svm_clf)], voting='soft')
# 计算每个分类其的概率值 
for clf in (log_clf, rfc_clf, svm_clf, voting_clf):
 clf.fit(X_train, y_train)
 y_pred = clf.predict(X_test)
 print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

输出结果：

LogisticRegression0.864

RandomForestClassifier0.88

SVC0.896

VotingClassifier0.92

软投票策略效果⾼于硬投票。

对于单个模型，可以通过其classes_属性和predict_proba()⽅法得到每个类别的概率。

Bagging策略

算法过程

⾸先对训练数据集进⾏多次采样（数据和特征），保证每次得到的采样数据都是不同的（增加多样性，同时降低过拟合⻛险）。
训练多个基础模型（并⾏计算，提升效率）。
训练的每个模型效果有好有坏，预测新样本时，先得到所有模型的预测结果，再进⾏集成，最终确定整体预测结果。

特点：

• 随机：数据采样随机，特征选择随机。

• 并⾏计算，分别同时独⽴训练多个模型，增加多样性。

• 预测时需得到所有基础模型结果再进⾏集成。

• 解决过拟合问题。

代码⽰例：

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
# 基础模型，模型个数，最⼤特征数 
bag_clf = BaggingClassifier(estimator=DecisionTreeClassifier(), # 基础模型 
 n_estimators=500, # 模型个数 
 max_features=X.shape[1], # 最⼤特征数 
 bootstrap=True, # 是否随机采样 
 n_jobs=-1, # 是否多线程计算，-1表⽰使⽤所有CPU 
 random_state=42)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

# 决策树⽰例 
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
y_pred_tree = tree_clf.predict(X_test)
accuracy_score(y_test, y_pred_tree)

在上述⽰例中，构建了Bagging分类器和决策树分类器，其运⾏结果为0.92和0.856，Bagging的效果⾼于决策树。

由于同时训练多个决策树，算法效率会降低。

可视化决策边界

import numpy as np
from matplotlib.colors import ListedColormap
def plot_decision_boundary(clf, X, y, axes=[-2, 3, -2, 2], alpha=0.5, 
contour=True):
 x1s = np.linspace(axes[0], axes[1], 100)
 x2s = np.linspace(axes[2], axes[3], 100)
 x1, x2 = np.meshgrid(x1s, x2s)
 X_new = np.c_[x1.ravel(), x2.ravel()]
 y_pred = clf.predict(X_new).reshape(x1.shape)
 custom_cmap = ListedColormap(['#FFC0CB', '#F0FFFF', '#507d50'])
 plt.contourf(x1, x2, y_pred, cmap=custom_cmap, alpha=0.5)
 if contour:
 custom_cmaps = ListedColormap(['#7d7d58', '#F0FFFF', '#507d50'])
 plt.contour(x1, x2, y_pred, cmap=custom_cmaps, alpha=0.8)
 plt.plot(X[:, 0][y == 0], X[:, 1][y == 0], 'ro', alpha=0.6)
 plt.plot(X[:, 0][y == 0], X[:, 1][y == 1], 'bs', alpha=0.6)
 plt.axis(axes)
 plt.xlabel('x1')
 plt.xlabel('x2')
plt.figure(figsize=(12, 5))
plt.subplot(121)
plot_decision_boundary(tree_clf, X, y)
plt.title('Decision Tree')
plt.subplot(122)
plot_decision_boundary(bag_clf, X, y)
plt.title('Decision Tree with Bagging')
plt.show()

总体上，Bagging效果⾼于单模型。

在(1.5，0.25)位置上，有两个红⾊点，按照分布来看，这两个点更有可能是蓝⾊点。决策树对这两个

点分类正确，⽽Bagging分类错误，模型泛化能⼒上，Bagging⾼于单模型的决策树。

OOB策略

• out of bag 解决验证问题

在Bagging⽅法中，模型通常是基于对训练集进⾏有放回抽样⽽构建的。由于有放回抽样的⽅式，⼤约有37%的训练样本不会被抽中，这些未被抽中的样本就被称为OOB样本。

oob_score参数是为了利⽤这些未被抽中的样本来评估模型的性能。在训练过程中，对每个基础模型，可以使⽤其对OOB样本的预测来计算⼀个袋外分数（OOBScore）。最终，这些袋外分数可以⽤来估计整体模型在未⻅过的数据上的性能。

在进⾏有放回的抽样时，对于⼀个包含N个样本的训练集，抽样时，某个样本被抽中的概率为1/N 。因此，某个样本没有被选中的概率是(1 - 1/n)^n

，在N很⼤时会趋近于1/e，约为0.37。

代码⽰例

# oob_score=True，使⽤OOB策略 
 bag_clf = BaggingClassifier(
 estimator=DecisionTreeClassifier(), # 基础模型 
 n_estimators=500, # 模型个数 
 max_features=X.shape[1],  # 最⼤特征数                                       bootstrap=True,  # 是否随机采样                             
 n_jobs=-1,  # 是否多线程计算，-1表⽰使⽤所有CPU                                    random_state=42,                            
 oob_score=True)
bag_clf.fit(X_train, y_train)11 bag_clf.oob_score_  # 验证集的结果

输出：0.896

随机森林

• Bagging的经典代表。

•能够处理很⾼维度的数据，并且不⽤做特征选择。

•在训练完后，能够给出哪些特征变量⽐较重要，解释性强。

•可以并⾏计算，速度⽐较快。

代码⽰例

from sklearn.datasets import load_iris
iris = load_iris()
rf_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rf_clf.fit(iris['data'], iris['target'])
for name, score in zip(iris['feature_names'], rf_clf.feature_importances_):
    print(name, score)

上述代码输出为：

sepal length (cm) 0.10090833612940953

sepal width (cm) 0.022648813776966623

petal length (cm) 0.43745799107944655

petal width (cm) 0.4389848590141774

feature_importances_属性表⽰特征重要性。

特征重要性

数据准备：⼿写数字数据集，每⼀⾏表⽰28*28灰度图像像素值。

数据每⼀⾏是784列，mnist是⼀个类似字典的数据格式。

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, parser='auto')
rf_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rf_clf.fit(mnist['data'], mnist['target'])

可以通过可视化⽅法反转绘制原图，使⽤第⼀⾏的数据，转为28*28的⼆维数组。

表⽰真实数字5。

plt.imshow(mnist['data'].iloc[0, :].values.reshape((28, 28)), cmap='gray')
plt.show()

模型训练完成后，可以查看其特征重要性（每个像素点对于分类的重要性）。

import matplotlib.cm
def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap=matplotlib.cm.hot)
    plt.axis('off')
    plt.colorbar(ticks=[rf_clf.feature_importances_.min(),
    rf_clf.feature_importances_.max()])
plot_digit(rf_clf.feature_importances_)

在图⽚的四个⻆落位置，像素值基本为0（在处理图像时，习惯将数字放在中间），⽽中间位置通常是数字经过的地⽅，在中间位置，其特征重要性远⾼于⻆落位置。

Boosting-提升策略

AdaBoost

adaptive boosting，⾃适应提升。

AdaBoost原理

做题时，这次做错的题，下次向做对⽅向。
不断提升错误样本重要性。样本颜⾊越深，重要性越⾼。

以下⽰例使⽤SVM分类演⽰AdaBoost的流程。

from sklearn.svm import SVC
m = len(X_train)
plt.figure(figsize=(14, 5))
for subplot, learning_rate in ((121, 1), (122, 0.5)):
 
 # 最初所有样本权重⼀致 
 sample_weights = np.ones(m)
 
 plt.subplot(subplot)
 for i in range(5):
 svm_clf = SVC(kernel='rbf', C=0.05, random_state=42)
 
 # 训练模型，设置样本权重sample_weight 
 svm_clf.fit(X_train, y_train, sample_weight=sample_weights)
 y_pred = svm_clf.predict(X_train)
 
 # 更新权重，learning_rate控制权重增⼤或减⼩的幅度 
 sample_weights[y_pred != y_train] *= (1+learning_rate)
 
 # 决策边界 
 plot_decision_boundary(svm_clf, X, y, alpha=0.2)
 plt.title('learning_rate = {}'.format(learning_rate))
 if subplot == 121:
 plt.text(-0.7, -0.65, '1', fontsize=25)
 plt.text(-0.6, -0.10, '2', fontsize=25)
 plt.text(-0.5, 0.15, '3', fontsize=25)
 plt.text(-0.4, 0.55, '4', fontsize=25)
 plt.text(-0.3, 0.90, '5', fontsize=25)
plt.show()

重点：

• 权重更新

• 调整学习率，过⼤的学习率可能导致模型变差，较⼩的学习率会增加迭代次数。

AdaBoost⽰例

from sklearn.ensemble import AdaBoostClassifier
# AdaBoost分类器 
ada_clf = AdaBoostClassifier(
 estimator=DecisionTreeClassifier(max_depth=1), # 基础模型 
 n_estimators=200,
 learning_rate=0.5, # 学习率 
 random_state=42
)
ada_clf.fit(X_train, y_train)
plot_decision_boundary(ada_clf, X, y)

Gradient Boosting

梯度提升

Gradient Boosting原理

刚开始时，训练得到第⼀个基础模型，再次加⼊第⼆个训练好的基础模型，该模型有好有坏，坏的舍弃，最终得到⼀个越来越强的集成模型。每⼀次的计算是为了减少上⼀次的残差，GBDT在残差减少（负梯度）的⽅向上建⽴⼀个新的模型。

从弱学习器开始，通过加权来进⾏训练，不断减⼩误差。
将上⼀轮残差作为本轮输⼊，最⼩化本轮残差。
串⾏计算，模型只能⼀步⼀步提升，算法效率低。
解决⽋拟合问题。

Gradient Boosting代表：梯度提升决策树（Gradient Boosting Decision Tree，GBDT）

GBDT结合了提升⽅法（Boosting）的思想与决策树模型，通过迭代地训练决策树来减少模型的偏差，从⽽提⾼模型的预测精度。

下⾯是GBDT的基本原理：

提升⽅法（Boosting）提升⽅法是⼀种集成学习技术，其核⼼思想是将多个弱学习器组合成⼀个强学习器。在GBDT中，这些弱学习器通常是决策树，特别是回归树（⽤于回归任务）或分类树（⽤于分类任务）。每⼀轮的学习都是为了减少上⼀轮的残差（实际值与预测值之间的差异），通过这种⽅式，模型逐步增强其预测能力。
梯度提升-负梯度⼀阶展开梯度提升是提升⽅法的⼀种，它利⽤梯度下降算法来最⼩化损失函数。在每⼀轮迭代中，GBDT都会构建⼀个新的决策树来预测前⼀轮所有树的残差。然后，它将这个新树的预测添加到之前的树的预测上，通过这种⽅式逐步减少总体的损失。
迭代过程
初始化：GBDT⾸先使⽤⼀个基础模型来进⾏初始化预测。
迭代构建：接着，算法进⼊迭代循环，每⼀次迭代包括以下步骤： a. 计算残差：计算当前模型的残差，即真实值与当前模型预测值之间的差。 b. 构建决策树：使⽤残差作为⽬标值，构建⼀个新的决策树。 c. 更新模型：将这棵新树的预测结果乘以⼀个学习率（也称为步⻓）后加到现有模型的预测上，以更新模型。
模型输出：迭代达到指定次数后，所有决策树的预测结果相加得到最终的模型预测。
学习率（Learning Rate）学习率是⼀个重要的超参数，它控制着每棵树对最终模型的贡献度。较⼩的学习率意味着需要更多的树来训练模型，通常可以提⾼模型的性能，但同时也会增加计算成本。
正则化为了避免过拟合，GBDT还可以通过各种⽅法进⾏正则化，如设置树的最⼤深度、最⼩叶⼦节点的样本数、树的最⼤叶⼦节点数等。

数据准备：随机⽣成⼀些样本点。

np.random.seed(42)
X = np.random.rand(100, 1)-0.5
y = 3*X[:, 0]**2 + 0.05*np.random.randn(100)

第⼀步，训练模型。

from sklearn.tree import DecisionTreeRegressor
tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X, y)

第⼆步，得到第⼀轮的误差。

# 残差 
y2 = y-tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)

第三步，计算第⼆轮的误差。

y3 = y2-tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X, y3)

预测结果：0.75026781，其真实值为1.92，模型误差依然较⼤（只进⾏两轮误差学习，次数太少）。

X_new = np.array([[0.8]])
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))
y_pred

绘制每⼀轮的结果。

Gradient Boosting⽰例

from sklearn.ensemble import GradientBoostingRegressor
gbdt=GradientBoostingRegressor(
max_depth=2,
n_estimators=3,
learning_rate=1.0, 
random_state=41)

gbdt.fit(X, y)
gbdt_slow_1=GradientBoostingRegressor(
max_depth=2,
n_estimators=3,
learning_rate=0.1,
random_state=41)

gbdt_slow_1.fit(X, y)

gbdt_slow_2 = GradientBoostingRegressor(max_depth=2,
 n_estimators=200,
 learning_rate=0.1,
 random_state=41)
gbdt_slow_2.fit(X, y)

学习率⼩，森林规模应该⾜够⼤。

plt.figure(figsize=(11, 4))
plt.subplot(121)
plot_predictions([gbdt], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label='Ensemble 
predictions')
plt.title('learning rate = {}, n_estimators = {}'.format(gbdt.learning_rate, 
gbdt.n_estimators))
plt.subplot(122)
plot_predictions([gbdt_slow_1], X, y, axes=[-0.5, 0.5, -0.1, 0.8], 
label='Ensemble predictions')
plt.title('learning rate = {}, n_estimators = 
{}'.format(gbdt_slow_1.learning_rate, gbdt_slow_1.n_estimators))
plt.show()

规模⼤容易过拟合。

plt.figure(figsize=(11, 4))
plt.subplot(121)
plot_predictions([gbdt_slow_2], X, y, axes=[-0.5, 0.5, -0.1, 0.8], 
label='Ensemble predictions')
plt.title('learning rate = {}, n_estimators = 
{}'.format(gbdt_slow_2.learning_rate, gbdt_slow_2.n_estimators))
plt.subplot(122)
plot_predictions([gbdt_slow_1], X, y, axes=[-0.5, 0.5, -0.1, 0.8], 
label='Ensemble predictions')
plt.title('learning rate = {}, n_estimators = 
{}'.format(gbdt_slow_1.learning_rate, gbdt_slow_1.n_estimators))
plt.show()

XGBOOST

XGBoost（极端梯度提升）是GBDT（Gradient Boosting Decision

Tree）的⼀种⾼效且⼴泛使⽤的实现，旨在优化传统GBDT的速度和性能。

XGBoost对传统的GBDT算法进⾏了优化，包括计算速度（⼆阶泰勒展开）和内存使⽤上的优化。它采⽤了⼀种称为“近似分裂点算法”的技术，能够在处理⼤规模数据时显著提⾼效率。
不同于传统的GBDT，XGBoost在损失函数中加⼊了正则化项（L1和L2正则化）。这有助于控制模型的复杂度，从⽽减少过拟合的⻛险，提⾼模型的泛化能⼒。
并⾏处理，虽然树的构建本⾝是顺序的，XGBoost能够在特征维度上进⾏并⾏处理，通过优化数据结构和算法来加速训练过程。

简单示例

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
# 加载数据集 
data = load_breast_cancer()
X = data.data
y = data.target
# 划分训练集和测试集 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
random_state=42)
# 转换为DMatrix数据格式，这是XGBoost的优化数据结构 
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# 设置参数 
# 这⾥是⼆分类问题，使⽤'binary:logistic'作为⽬标函数 
param = {
 'max_depth': 3, # 树的最⼤深度 
 'eta': 0.3, # 学习率 
 'objective': 'binary:logistic',
 'eval_metric': 'logloss' # 评估指标 
}
num_round = 100 # 训练轮数 
# 训练模型 
bst = xgb.train(param, dtrain, num_round)
# 使⽤训练好的模型进⾏预测 
preds = bst.predict(dtest)
predictions = [round(value) for value in preds]
# 评估模型 
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

主要参数

binary:logistic ⽤于⼆分类，返回分类的概率

multi:softmax 多分类问题，返回类别

multi:softprob 多分类，返回样本属于每⼀类的概率

通用参数

booster: 选择每⼀次迭代的模型，常⽤的有gbtree（树模型）、gblinear（线性模型）。

nthread: 并⾏线程的数量。默认值为最⼤可能的线程数。在不同的系统中，这个参数可能被命名为n_jobs。

集成（Booster）参数

eta (也称为learning_rate): 每⼀步迭代的步⻓缩减，⽤来防⽌过拟合。通过减少每⼀步的权重，可以让模型更加稳健。范围是0,1。

min_child_weight: 决定最⼩叶⼦节点样本权重和。⽤来控制过拟合。较⼤的值会导致模型更加保守。

max_depth: 树的最⼤深度。⽤来控制过拟合，较⼤的值会让模型学习到更具体的样本特征。

gamma (也称为min_split_loss): 在节点分裂时只有在分裂所带来的损失函数的减少量⼤于此值时才会分裂，在回归任务中它就像是正则化参数⼀样。

subsample: 训练每棵树时⽤到的数据⽐率，防⽌过拟合。

colsample_bytree: 建⽴每棵树时列的采样⽐率。

lambda (L2 正则化项权重): 正则化项。增加此值会使模型更加保守。

alpha (L1 正则化项权重): 可以应⽤于⾮常⾼维的情况下，使得算法的运⾏速度更快。

任务参数

objective: 指定学习任务和相应的学习⽬标，例如binary:logistic⽤于⼆分类问题

reg:squarederror⽤于回归问题。

eval_metric: 评价指标，⽤于模型的评估。例如auc表⽰⾯积下曲线值，对于回归问题可以使⽤rmse（均⽅根误差）。

seed: 随机种⼦，⽤于产⽣可复现的结果。

提前停⽌策略

不断提升后，可能出现模型效果变差的情况，此时需要提前终⽌训练。

from sklearn.metrics import mean_absolute_error
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
gbdt = GradientBoostingRegressor(max_depth=2,
 n_estimators=120,
 learning_rate=1.0,
 random_state=42)
gbdt.fit(X_train, y_train)
# staged_predict 每棵树依次加进来后的预测值 
errors = [mean_absolute_error(y_pred, y_test) for y_pred in 
gbdt.staged_predict(X_test)]

误差最⼩的迭代次数(28,0.04242765129927983)

best_n_estimators = np.argmin(errors)
min_error = np.min(errors)
best_n_estimators, min_error

使⽤最⼩误差的迭代次数：

gbdt_best = GradientBoostingRegressor(max_depth=2,
 n_estimators=best_n_estimators,
 random_state=42)
gbdt_best.fit(X_train, y_train)
plt.figure(figsize=(11, 4))
plt.subplot(121)
plt.plot(range(1, 121), errors, 'b-')
plt.plot([best_n_estimators, best_n_estimators], [0, min_error], 'k--')
plt.plot([0, 120], [min_error, min_error], 'k--')
plt.title('test error')
plt.subplot(122)
plot_predictions([gbdt_best], X, y, axes=[-0.5, 0.5, -0.1, 0.8])
plt.title('Best Model(%d trees)' % best_n_estimators)
plt.show()