# R 梯度提升算法①

## 用gbm包实现随机梯度提升算法

• 在gbm包中，采用的是决策树作为基学习器，重要的参数设置如下：
• 损失函数的形式(distribution)
• 迭代次数(n.trees)
• 学习速率(shrinkage)
• 再抽样比率(bag.fraction)
• 决策树的深度(interaction.depth)
• 损失函数的形式容易设定，分类问题一般选择bernoulli分布，而回归问题可以选择gaussian分布。学习速率方面，我们都知道步子迈得太大容易扯着，所以学习速率是越小越好，但是步子太小的话，步数就得增加，也就是训练的迭代次数需要加大才能使模型达到最优，这样训练所需时间和计算资源也相应加大了。gbm作者的经验法则是设置shrinkage参数在0.01-0.001之间，而n.trees参数在3000-10000之间。
```setwd("E:\\Rwork")
if(!suppressWarnings(require('gbm')))
{
install.packages('gbm')
require('gbm')
}

# 加载包和数据
library(gbm)
data(PimaIndiansDiabetes2,package='mlbench')
# 将响应变量转为0-1格式
data <- PimaIndiansDiabetes2
data\$diabetes <- as.numeric(data\$diabetes)
data <- transform(data,diabetes=diabetes-1)
# 使用gbm函数建模
model <- gbm(diabetes~.,data=data,shrinkage=0.01,
distribution='bernoulli',cv.folds=5,
n.trees=3000,verbose=F)
# 用交叉检验确定最佳迭代次数
best.iter <- gbm.perf(model,method='cv')

# 观察各解释变量的重要程度
summary(model,best.iter)

# 变量的边际效应
plot.gbm(model,1,best.iter)
library(caret)
data <- na.omit(PimaIndiansDiabetes2)
fitControl <- trainControl(method = "cv",
number = 5,
returnResamp = "all")
model2 <- train(diabetes~.,
data=data,method='gbm',
distribution='bernoulli',
trControl = fitControl,
verbose=F,
tuneGrid = data.frame(n.trees=1200,shrinkage=0.01,interaction.depth=1, n.minobsinnode = 10))
model2```
``` Stochastic Gradient Boosting

392 samples
8 predictor
2 classes: 'neg', 'pos'

No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 314, 314, 313, 314, 313
Resampling results:

Accuracy   Kappa
0.7780915  0.4762955

Tuning parameter 'n.trees' was held constant at a value of 1200
Tuning parameter 'interaction.depth' was
held constant at a value of 1
Tuning parameter 'shrinkage' was held constant at a value of 0.01

Tuning parameter 'n.minobsinnode' was held constant at a value of 10```

