前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >第8章 集成学习 笔记

第8章 集成学习 笔记

作者头像
用户1075469
发布2022-03-04 11:23:35
4730
发布2022-03-04 11:23:35
举报
文章被收录于专栏:科技记者科技记者

将多个分类器的预测结果进行组合得到最终决策,来获得更好的分类及回归性能。单一分类器只适合于某种特定类型的数据,很难保证得到最佳分类模型,如果对不同算法的预测结果取平均,相比一个分类器,可能会获得更好的分类模型。bagging, boosting和随机森林是应用最广泛的三类集成学习算法。

  • bagging:投票式算法,首先bootstrap产生不同的训练数据集,然后得到多个基础分类器,最后组合得到一个相对更优的模型。
  • Boosting:与bagging类似,区别是boosting是顺序进行,后一轮分类器与之前分类器的结果有关,即在错分基础上学习,补偿学习。
  • 随机森林:包含多个决策树的分类器,通过投票得到分类结果,对每一类特征向量产生一棵单独的分类决策树,从这些分类结果中选择多个投票数最高的决策树完成分类,或者选择一个平均值作为回归处理的输出。

8.2 使用bagging方法对数据分类

adabag包对bagging和boosting两种方法都提供了支持,前者是Breman bagging算法(首次提出了版本分类器理论)。

代码语言:javascript
复制
install.packages("adabag")
library(adabag)
# 发现

data(iris)
churnTrain <- iris
ind <- sample(2,nrow(churnTrain),replace = TRUE,
              prob = c(0.7,0.3))
trainset <- churnTrain[ind==1,]
testset <- churnTrain[ind==2,]
set.seed(2)
churn.bagging <- bagging(churn~., data = trainset, mfinal = 10) #迭代次数为10
churn.bagging$importance
Petal.Length  Petal.Width Sepal.Length  Sepal.Width 
    75.53879     24.46121      0.00000      0.00000 
 churn.predbagging$confusion
               Observed Class
Predicted Class setosa versicolor virginica
     setosa         18          0         0
     versicolor      0         19         0
     virginica       0          2        15
churn.baggingcv$error
[1] 0.03703704

算法源于Bootstrap aggregation,具有稳定、准确、功能强大和易于实现的优点,常用于数据分类和回归处理。算法定义如下:给定大小为n的数据集,通过bootstrap抽样,得到m个新数据集Di,通过m个样本得到m个模型,然后获得最优模型。缺点是结果难以解释。扩展ipred包也可以实现同样功能,测试下来这个速度超快呢,上面那个半小时了还没动静,应该是没有交叉验证。

代码语言:javascript
复制
churn.bagging <- bagging(churn~., data = trainset, coob=TRUE)
churn.bagging
Bagging classification trees with 25 bootstrap replications 

Call: bagging.data.frame(formula = churn ~ ., data = trainset, coob = TRUE)

Out-of-bag estimate of misclassification error:  0.0606 
# 错分率
mean(predict(churn.bagging)!=trainset$churn)
[1] 0.06115418
# 预测 
churn.predction <- predict(churn.bagging, newdata = testset, type = "class")
prediction.table <- table(churn.predction, testset$churn)
prediction.table
churn.predction  yes   no
            yes  170   16
            no    57 1274

8.3 使用bagging方法进行交叉验证

评估分类模型的鲁棒性

代码语言:javascript
复制
# cv
churn.baggingcv <- bagging.cv(churn~., v = 10, data = trainset,
                              mfinal = 10)
# Error in bagging.cv(churn ~ ., v = 10, data = trainset, mfinal = 10) : 
   v should be in [2, n] 原数据集的问题,这里用iris代替
churn.baggingcv$confusion
               Observed Class
Predicted Class setosa versicolor virginica
     setosa         18          0         0
     versicolor      0         19         0
     virginica       0          2        15
# 错分率
churn.predbagging$error
[1] 0.03703704

churn数据集报错,这里用iris简单数据集曲线报国了,好处是可以节省时间。

8.4 使用boosting 方法进行数据分类

adabag实现了AdaBoost和SANME两个算法。

代码语言:javascript
复制
# boosting
set.seed(2)
churn.boost <- boosting(Species~., data = trainset, mfinal = 3,
                        coeflearn = "Freund", boos = FALSE,
                        control=rpart.control(maxdepth=3))
churn.boost.pred <- predict.boosting(churn.boost, newdata = testset)
churn.boost.pred$confusion
churn.boost.pred$confusion
               Observed Class
Predicted Class setosa versicolor virginica
     setosa         18          0         0
     versicolor      0         20         1
     virginica       0          1        14
churn.boost.pred$error
[1] 0.03703704

boosting算法的思想是弱分类器(如单一决策树),逐步优化(改变权重),使之成为强分类器。bagging和boosting都采用了集成学习的思想,不同之处是bagging组合独立的模型,boostong迭代学习。mfinal是迭代次数,coeflearn是权重更新系数,观测值权重boos,rpart的控制方法(单一决策树)。扩展

代码语言:javascript
复制
install.packages(c("mboost","ada"))
library(mboost)
library(pROC)
library(caret)
install.packages("MLmetrics")
set.seed(2)
ctrl <- trainControl(method = "repeatedcv", repeats = 1,
                     classProbs = TRUE, 
                     summaryFunction = twoClassSummary)
ada.train <- train(churn~.,data = trainset, method = "ada",
                   metric = "ROC", trControl = ctrl)
#  这里iris报错,切换回了churn数据集
  nu maxdepth iter       ROC      Sens        Spec      ROCSD     SensSD      SpecSD
1 0.1        1   50 0.8600045 0.9090204 0.010719176 0.03719839 0.05786791 0.007708342
...
plot(ada.train)
ada.predict <- predict(ada.train, testset, "prob")
ada.predict.result <- ifelse(ada.predict[1]>0.5, "yes", "no")
table(testset$churn, ada.predict.result)
     ada.predict.result
        no  yes
  yes   71  143
  no  1301    6

本章少有的图

8.5 使用boosting方法进行交叉验证

代码语言:javascript
复制
churn.boostingcv <- boosting.cv(Species~., v=10, data = trainset,
            mfinal = 5, control = rpart.control(cp=0.01))
churn.boostingcv$confusion
               Observed Class
Predicted Class setosa versicolor virginica
     setosa         32          0         0
     versicolor      0         26         3
     virginica       0          3        32
churn.boostingcv$error
[1] 0.0625

8.6 使用gradient boosting方法对数据进行分类

也是将弱分类器组合在一起,然后在与损失函数的负梯度最大相关时得到新的基础分类器,既可以回归分析,也可以分类,对不同数据集的适应能力都很好。

代码语言:javascript
复制
# gradient boosting
install.packages("gbm")
library(gbm)
# 响应值为0~1,所以转换下
trainset$churn <- ifelse(trainset$churn =="yes", 1,0)
set.seed(2)
churn.gbm <- gbm(formula = churn ~ ., distribution = "bernoulli", data = trainset,
                 n.trees = 1000, interaction.depth = 7, shrinkage = 0.01,
                 cv.folds = 3) # shrinkage 步长减少参数,即学习速度;interaction.depth决策树最大深度
summary(churn.gbm)
                                                       var    rel.inf
total_day_minutes                         total_day_minutes 29.8623601
total_eve_minutes                         total_eve_minutes 14.6407627
number_customer_service_calls number_customer_service_calls 12.5827527
total_intl_minutes                       total_intl_minutes  9.6529151
...
# 交叉验证,确定最佳迭代次数
churn.iter <- gbm.perf(churn.gbm, method = "cv")
# Bernoulli损失函数的对数奇点值 
churn.predict <- predict(churn.gbm, testset, n.trees = churn.iter)
str(churn.predict)
num [1:1521] -3.56 -3.36 -2.99 -3.82 -3.52 ...
# ROC, 得到最大准确率的最佳临界值
# ROC
churn.roc <- roc(testset$churn, churn.predict)
plot(churn.roc)
# coords得到最佳临界值
coords(churn.roc, "best")
   threshold specificity sensitivity
1 -0.7369319   0.8738318   0.9869931
# coords得到最佳临界值
coords(churn.roc, "best")
churn.predict.class <- ifelse(churn.predict >c(coords(churn.roc,"best")["threshold"]), 
                              "yes","no")
table(testset$churn, churn.predict.class)
     churn.predict.class
        no  yes
  yes   27  187
  no  1290   17

算法的思想如下:首先,计算每个划分的数据集残差的方差,并据此确定每个阶段的最优划分,被选中模型将前一阶段处理得到的方差作为学习目标重新建模,缩小。采用梯度下降,也就是沿着导数下降的方向进行变化,使剩余方差最小化。拓展

代码语言:javascript
复制
library(mboost)
# 仅支持数值,去除,转化非数值, 这里发现前面错误的来源应该是这个yes和no的转换,只有加个c()才行,不科学呀,不管了,达到目的即可
trainset$churn <- ifelse(trainset$churn ==c("yes"),1,0)
trainset$voice_mail_plan = NULL
trainset$international_plan = NULL
churn.mboost <- mboost(churn ~., data = trainset, control = boost_control(mstop = 10))
summary(churn.mboost)
  Model-based Boosting

Call:
mboost(formula = churn ~ ., data = trainset, control = boost_control(mstop = 10))


  Squared Error (Regression) 

Loss function: (y - f)^2 
 

Number of boosting iterations: mstop = 10 
Step size:  0.1 
Offset:  0.1417074 
Number of baselearners:  14 

Selection frequencies:
            bbs(total_day_minutes) bbs(number_customer_service_calls) 
                               0.6                                0.4 
par(mfrow=c(1,2))
plot(churn.mboost)

重要属性的局部贡献

计算分类器边缘

代码语言:javascript
复制
boost.margins <- margins(churn.boost, trainset)
boost.pred.margins <- margins(churn.boost.pred, testset)
plot(sort(boost.margins[[1]]), 
     (1:length(boost.margins[[1]]))/length(boost.margins[[1]]),
     type = 'l', xlim = c(-1,1), 
     main = "Boosting:Magrin cumulative distribution graph",
     xlab = "margin", ylab = "% observations", col= 'blue')
lines(sort(boost.pred.margins[[1]]), 
      (1:length(boost.pred.margins[[1]]))/length(boost.pred.margins[[1]]),
      type = "l", col="green")
abline(v=0, col='red', lty=2)

boosting分类器的边缘累积分布图

代码语言:javascript
复制
# 与训练和测试集错误匹配负边缘的百分比
boosting.training.margin <- table(boost.margins[[1]]>0)
boosting.negative.training <- as.numeric(boosting.training.margin[1])/boosting.training.margin[2]
boosting.negative.training
     TRUE 
0.0212766 
代码语言:javascript
复制
# 计算bagiing分类器的边缘
bagging.margins = margins(churn.bagging, trainset)
bagging.pred.margins <- margins(churn.predbagging,testset)
plot(sort(bagging.margins[[1]]),
     (1:length(bagging.margins[[1]]))/length(bagging.margins[[1]]),
     type = "l", xlim = c(-1,1), 
     main = "Bagging:Magrin cumulative distribution graph",
     xlab = "margin", ylab = "% observations", col= 'blue')
lines(sort(bagging.pred.margins[[1]]), 
      (1:length(bagging.pred.margins[[1]]))/length(bagging.pred.margins[[1]]),
      type = "l", col="green")
abline(v=0, col='red', lty=2)
# 同样计算百分比
bagging.training.margin <- table(bagging.margins[[1]]>0)
bagging.negative.training <- as.numeric(bagging.training.margin[1])/boosting.training.margin[2]
bagging.negative.training

边缘是分类器确定性的一种度量,是根据分类样本数及最大错分样本计算得来的。正确分类样本建立边缘,错误分类样本形成负边缘,如果边缘接近1,说明正确分类的样本可信度非常高。分类不确定的样本只有较小的边缘。margin函数能计算AdaBoost.M1、AdaBoost-SAMME以及baaging分类器的边缘,返回一个边缘向量,可以绘制边缘累积分布曲线展现边缘分布情况,如果每个观测都能正确划分,分布图会是边缘值为1的垂线。通常情况下,训练数据集的错分样例负边缘与测试数据集的错分负边缘差不多。

计算集成分类算法的误差演变

代码语言:javascript
复制
# 误差演变
boosting.evol.train <- errorevol(churn.boost, trainset)
boosting.evol.test <- errorevol(churn.boost, testset)
plot(boosting.evol.test$error, type = "l", ylim = c(0,1),
     main = "Boosting error versus number of trees",
     xlab = "Iteration", ylab = "Error", col='red',
     lwd=2)
lines(boosting.evol.train$error, cex = .5, col='blue',lty=2,
      lwd=2)
legend('topright', c('test','train'), col = c('red', 'blue'),
       lty = 1:2, lwd=2)

adabag包中提供了errorevol函数以方便用户根据迭代次数估算集成分类算法的误差。

代码语言:javascript
复制
# bagging
# 误差演变
bagging.evol.train <- errorevol(churn.bagging, trainset)
bagging.evol.test <- errorevol(churn.bagging, testset)
plot(bagging.evol.test$error, type = "l", ylim = c(0,1),
     main = "Bagging error versus number of trees",
     xlab = "Iteration", ylab = "Error", col='red',
     lwd=2)
lines(bagging.evol.train$error, cex = .5, col='blue',lty=2,
      lwd=2)
legend('topright', c('test','train'), col = c('red', 'blue'),
       lty = 1:2, lwd=2)

图展示了每次迭代后的分类误差的变化,可以调用predict.bagging和predict.boosting进行剪枝。

8.9 随机森林对数据分类

训练过程中产生多棵决策树,每棵会根据输入产生预测输出,采用投票机制选择类别众数作为预测结果。

代码语言:javascript
复制
# random Forest
install.packages("randomForest")
library(randomForest)
churn.rf <- randomForest(churn ~., data = trainset, importance=T) # 对预测器的重要性进行评估
churn.rf
Call:
 randomForest(formula = churn ~ ., data = trainset, importance = T) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 4

        OOB estimate of  error rate: 4.31%
Confusion matrix:
    yes   no class.error
yes 363  130 0.263691684
no   20 2966 0.006697924
# 分类预测
churn.prediction <- predict(churn.rf, testset)
table(churn.prediction, testset$churn)
churn.prediction  yes   no
             yes  167    7
             no    47 1300
plot(churn.rf)
importance(churn.rf)
                                     yes         no MeanDecreaseAccuracy
international_plan            93.0223581 72.5504101           95.5848053
voice_mail_plan               22.5321109 18.2760474           22.8558091
number_vmail_messages         23.0980210 17.5029154           22.6011108
total_day_minutes             33.4914749 33.8653396           43.0515228
varImpPlot(churn.rf)
margins.rf <- margin(churn.rf, trainset)
plot(margins.rf)
hist(margins.rf, main = "Margins of Random Forest for churn dataset")
boxplot(margins.rf~ trainset$churn, main = "Margins of Random Forest for churn dataset by class")

随机森林将多个弱学习机(决策树)组合得到一个强学习机,处理过程和bagging非常相似,首先boostrap采样,从中找到能提供最佳分割效果的预测属性。如果是回归,将取所有预测的平均值或者加权平均值作为最后输出,如果是分类,选择类别预测众数作为最终预测。算法包括两个参数,ntree决策树个数和mtry可用来寻找最佳特征的特征个数,bagging算法只使用前者,如果mtry=训练数据集的特征值,随机森林就等同于bagging了。最大优点是计算容易,高效,对缺失数据或不平衡数据容错度较高;主要缺点是不能预测超过训练集之外的数据,容易被噪声数据影响出现过度适应。拓展cforest包的cforest函数同样可以实现随机森林算法

代码语言:javascript
复制
# 拓展
install.packages("party")
library(party)
churn.cforest <- cforest(churn~., data = trainset, 
                         controls = cforest_unbiased(ntree=1000,mtry=5))
churn.forest.prediction <- predict(churn.cforest, testset, OOB=TRUE, type = "response")
table(churn.forest.prediction, trainset$churn) # 这个地方很神奇,先打个问号
                       
churn.forest.prediction  yes   no
                    yes  348   21
                    no   145 2965

8.10 估算不同分类器的预测误差

对多种分类算法采用errorest函数进行十折交叉验证,证明集成分类器是否比单一决策树分类效果更优。

代码语言:javascript
复制
# ipred erroest
library(ipred)
churn.bagging <- errorest(churn ~., data = trainset, model = bagging);churn.bagging
Call:
errorest.data.frame(formula = churn ~ ., data = trainset, model = bagging)

  10-fold cross-validation estimator of misclassification error 

Misclassification error:  0.052 
library(ada)
churn.mboosting <- errorest(churn ~., data = trainset, model = ada);churn.mboosting
Call:
errorest.data.frame(formula = churn ~ ., data = trainset, model = ada)

  10-fold cross-validation estimator of misclassification error 

Misclassification error:  0.048 
hurn.rf <- errorest(churn~., data = trainset, model = randomForest);churn.rf
Call:
errorest.data.frame(formula = churn ~ ., data = trainset, model = randomForest)

  10-fold cross-validation estimator of misclassification error 

Misclassification error:  0.0454 
churn.tree <- errorest(churn~., data = trainset, model = rpart, predict=churn.predict);churn.tree
Call:
errorest.data.frame(formula = churn ~ ., data = trainset, model = rpart, 
    predict = churn.predict)

  10-fold cross-validation estimator of misclassification error 

Misclassification error:  0.0606 

randomForest的错分率最低,性能最佳,单棵树的性能最差,集成学习优于单树。ada提供了boosting分类的方法。

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2021-12-26,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 科技记者 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 8.2 使用bagging方法对数据分类
  • 8.3 使用bagging方法进行交叉验证
  • 8.4 使用boosting 方法进行数据分类
  • 8.5 使用boosting方法进行交叉验证
  • 8.6 使用gradient boosting方法对数据进行分类
  • 计算分类器边缘
  • 计算集成分类算法的误差演变
  • 8.9 随机森林对数据分类
  • 8.10 估算不同分类器的预测误差
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档