统计学习导论 Chapter8 -- Tree-Based Methods

用户1148525

发布于 2018-01-03 16:18:16

1.3K0

发布于 2018-01-03 16:18:16

Book: An Introduction to Statistical Learning with Applications in R http://www-bcf.usc.edu/~gareth/ISL/

本章介绍用于回归和分类的基于树的方法。这些方法将预测空间切分为一组简单的区域。为了对一个给定观测值进行预测，通常的做法是使用其所属的区域内所有训练样本的 mean or mode。因为切分预测空间的切分规则可以总结为一个树，这些方法被称之为 decision tree 方法。 Tree-based methods 从方法可解释性（interpretation）的角度来说是简单有用的。但是和最先进的有监督算法相比较，性能要差一些。所以这里我们也介绍了 bagging, random forests, and boosting 等方法，这些方法涉及生成多个树相结合用于产生一个 consensus prediction（少数服从多数的投票）。我们可以看到将大量树组合起来可以提升预测的精度，但是牺牲了方法的可解释性interpretation。

8.1 The Basics of Decision Trees 决策树可以用于回归和分类问题。这里我们首先来看看回归问题，再讨论分类问题。 8.1.1 Regression Trees 为了深入展开回归树的讨论，我们从一个简单的例子开始。

Predicting Baseball Players’ Salaries Using Regression Trees 使用回归树预测棒球运动员的薪水这里我们有一个相关数据库： Hitters data set，这里运动员的薪水主要受两个因素影响：Years（在主要联盟中打球多少年）和 Hits（上一年的 hits 数目）。这里我们将不完整的数据剔除了。

上图对应的 three-region partition

Prediction via Stratification of the Feature Space 基于特征空间划分的预测一个回归树的构建简单的来说可以分为以下两个步骤： 1）我们将预测空间切分为 J 个 distinct and non-overlapping 区域 2）对于落入同一个区域的观测值，我们给出的预测值是一样的。该预测值就是该区域所有的训练样本的 the mean of the response values

接下来我们探讨如何对预测空间进行切分得到 J 个 distinct and non-overlapping 区域？理论上这些区域可以是任何形状，但是为了便于解释预测模型的结果，我们选择将预测空间切分为高纬矩形或boxes，我们的目标是找到 J个boxes 最小化 RSS

考虑到计算的可行性，我们不能使用穷举的方法来找到最优的 boxes。我们采取的方法是一个 top-down, greedy 称之为 recursive binary splitting 这个方法是简单的。 The approach is top-down because it begins at the top of the tree (at which point all observations belong to a single region) and then successively splits the predictor space; each split is indicated via two new branches further down on the tree. It is greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step.

Tree Pruning 上面的预测空间切分方法对于训练数据可以产生很好的结果，但是容易过拟合，对测试数据的效果比较差。这是因为生成的树可以过于复杂。一个小一点的树可以会得到 lower variance and better interpretation at the cost of a little bias。怎么解决这个问题了？一个比较好的策略就是首先生成一个很大的树，然后再对该树进行裁剪得到一个子树。裁剪的标准是 the lowest test error rate。给定一个 subtree，我们可以使用 cross-validation or the validation set 来估计 test error 。对每个可能的 subtree 进行 cross-validation error 估计太繁琐了，因为可能的 subtree 数量很大。所以我们需要一个方法来选择少数的 subtrees 来考虑。

Cost complexity pruning — also known as weakest link pruning 这个算法就可以帮我们解决上面的问题。不用考虑每个可能的 subtree，我们 consider a sequence of trees indexed by a nonnegative tuning parameter α。整个算法流程如下表所示：

8.1.2 Classification Trees 分类树和回归树很相似，除了预测的值是定性响应之外。对于回归树，我们对某一观测值的预测值是其对应的区域内所有训练样本的响应均值。对于分类树，我们对某一观测值的预测类别是其对应区域内训练样本中类别最多的类。 for a classification tree, we predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs。分类树的生成和回归树的生成是类似的，有一些细节上的差异。in the classification，RSS cannot be used as a criterion for making the binary splits

classification error rate： However, it turns out that classification error is not sufficiently sensitive for tree-growing in practice two other measures are preferable Gini index 和 cross-entropy

8.1.3 Trees Versus Linear Models 线性模型和树的对比线性回归假定一个模型具有一个线性表达式

regression trees 假定一个模型具有如下形式：

上面两个模型哪个更好了？这个依赖于具体问题。如果特征和响应的关系可以用一个线性模型很好的表示，那么线性回归将会工作的更好。如果there is a highly nonlinear and complex relationship between the features and the response as indicated by model (8.9) 回归树则更适合处理这个问题

The relative performances of tree-based and classical approaches can be assessed by estimating the test error, using either cross-validation or the validation set approach

8.1.4 Advantages and Disadvantages of Trees 树的优点： 1） Trees are very easy to explain to people. In fact, they are even easier to explain than linear regression! 树比较容易解释 2） Some people believe that decision trees more closely mirror human decision-making than do the regression and classification approaches seen in previous chapters. 有人认为决策树和人的决策行为更接近 3） Trees can be displayed graphically, and are easily interpreted even by a non-expert (especially if they are small). 树可以用图表展示，可以被大众解释清楚 4） Trees can easily handle qualitative predictors without the need to create dummy variables. 树不需要变量就可以解决定性预测

树的缺点： 5） Unfortunately, trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches seen in this book. 有时树的预测精度要差一些。

通过组合更好的决策树来提升树的预测性能, 组合方法 bagging, random forests, and boosting

8.2 Bagging, Random Forests, Boosting 8.2.1 Bagging bootstrap 是一个很好的idea，这里我们用它来提升决策树的性能。我们知道决策树有一个问题是 high variance，这就是说如果我们随机的将训练样本分为两个部分，我们分别用这两部分的数据拟合一个决策树，那么得到的两个决策树可能很不一样。如果一个方法是 low variance ，那么这两个决策树应该是类似的。linear regression 倾向于具有 low variance，如果 n 对 p 的比率足够的大。Bootstrap aggregation, or bagging 在统计学习中通常用于降低方差。这里我们介绍它是因为它再决策树中很重要，经常被使用。

给定一组 n 个独立的观测值 Z 1 ,…,Z n , 每个观测值的方差是 σ 2，那么这组观测值均值的方差就是 σ 2/n。也就是说 averaging a set of observations reduces variance。所以统计学习中通过降低方差提升统计学习方法的准确度一个很自然的方法就是从分布中提取多个训练数据集，对每个数据集建立一个独立的预测模型，最后将这些预测结果平均起来。

当然这通常是不现实的，因为我们没有多个训练数据集。但是我们可以 bootstrap，从一个训练数据集中重复采样得到 B different bootstrapped training data sets

尽管对于很多回归方法 bagging 可以提升其预测性能，它对于决策树特别有用。将 bagging 应用于回归树，我们只需通过 B bootstrapped training sets 来构建 B regression trees，然后我们将这些预测结果平均。这些树是 grown deep，没有裁剪，所以每个决策树都是高方差high variance 低偏差 low bias，将 B 个树平均起来可以降低方差。 bagging 通过将成百上千个树平均起来得到的结果性能改善很大。

上面讨论的是将 bagging 应用于回归问题 regression context，如何将 bagging 应用于分类问题了？有几种可能的策略，最简单的方式如下：给定一个观测值，我们使用 B 个树分别预测其类别，然后通过投票得到最终的输出类别 a majority vote

Out-of-Bag Error Estimation 如何对一个 bagged 模型进行 test error 估计了？其实这很简单，不需要使用 cross-validation or the validation set 方法。我们知道 bagging 关键在于对一个训练数据集的多个 bootstrapped subsets 分别拟合树。我们可以发现，总体来说每个 bagged tree 使用了整个训练数据集的 2/3样本，剩下的没有用于拟合一个给定的 bagged tree 1/3 样本称之为 out-of-bag (OOB) observations

We can predict the response for the ith observation using each of the trees in which that observation was OOB. This will yield around B/3 predictions for the ith observation. In order to obtain a single prediction for the ith observation, we can average these predicted responses (if regression is the goal) or can take a majority vote (if classification is the goal). This leads to a single OOB prediction for the ith observation.

It can be shown that with B sufficiently large, OOB error is virtually equivalent to leave-one-out cross-validation error. The OOB approach for estimating the test error is particularly convenient when performing bagging on large data sets for which cross-validation would be computationally onerous.

Variable Importance Measures bagging 提升模型性能的代价就是牺牲了模型的可解释性 bagging improves prediction accuracy at the expense of interpretability.

尽管对 bagged trees 进行解析比较困难，但是我们可以分析一下单个预测器的重要性 Although the collection of bagged trees is much more difficult to interpret than a single tree, one can obtain an overall summary of the importance of each predictor using the RSS (for bagging regression trees) or the Gini index (for bagging classification trees).

8.2.2 Random Forests 随机森林是对 bagged trees 的一种改进，主要是降低多个树的相关性。在 bagging 中我们对 bootstrapped training samples 建立多个决策树，在建立这些决策树的过程中，和 bagging 不同之处就是：每当我们需要做切分的时候 split，我们会从 p predictors 随机挑出 m predictors，这个 split 只能从这 m predictors 中挑出一个。通常我们选择m ≈ p的平方根

换句话说，在建立一个随机森林时，在树的每个 split 中，我们不是选择概率最大的预测器（the algorithm is not even allowed to consider a majority of the available predictors.）这看上去很疯狂，但是确实一个 clever rationale。假定在数据集上有一个很强的预测器，和一组其他相对较强的预测器。那么在 bagged trees，大部分树将使用这个最强的预测器。这就导致大部分 bagged trees 看上去是一样的。那么这些 bagged trees 做出的预测就具有很强的相关性，Hence the predictions from the bagged trees will be highly correlated. Unfortunately, averaging many highly correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated quantities. In particular, this means that bagging will not lead to a substantial reduction in variance over a single tree in this setting.

Random forests overcome this problem by forcing each split to consider only a subset of the predictors.

The main difference between bagging and random forests is the choice of predictor subset size m. For instance, if a random forest is built using m = p, then this amounts simply to bagging

换一种说法就是：在上面的tree bagging基础上，每一个model训练的时候对所选取的data split 随机选取一部分特征子集（random subset of the features）来训练

8.2.3 Boosting 首先回顾一下 bagging，先对原始数据集进行 bootstrap，得到多个 bootstrap data sets，对每个 bootstrap data set 建立一个独立的决策树。最后再平均所有决策树的输出。

Boosting 中的树是 sequentially 增长的，每个树的建立都使用以前建立的树的信息。each tree is fit on a modified version of the original data set

In general, statistical learning approaches that learn slowly tend to perform well. the boosting approach instead learns slowly

Boosting 中有三个可调参数： 1）The number of trees B，boosting can overfit if B is too large，这种过拟合是比较缓慢的。我们可以使用 cross-validation to select B 2） The shrinkage parameter λ, a small positive number， This controls the rate at which boosting learns. Typical values are 0.01 or 0.001 3） The number d of splits in each tree, which controls the complexity of the boosted ensemble。 Often d = 1 works well, in which case each tree is a stump, consisting of a single split. In this case, the boosted ensemble is fitting an additive model

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2017年12月07日，如有侵权请联系 cloudcommunity@tencent.com 删除

决策树