Chapter 6 Summary - Decision Tree - 第六章总结 决策树
In this chapter summary, I will briefly introduce how Decision Tree in Scikit-Learn works, as well as measurement of impurity.
Table of Content 内容提要
Cost function 代价函数
Like SVMs (introduced in Chapter 5 Summary), Decision Trees are versatile Machine Learning algorithms that can perform both classification and regression tasks. They are very powerful algorithms, capable of fitting complex datasets.
What is a Decision Tree 决策树是什么
What is a decision tree? Decision Trees are tree-like models, where each node contain a subset of the datasets and, if it is a nonleaf node, makes decisions on how the subset can be further split into more branches.
In the chart above, there are 150 samples in the top-level node. These samples are split into two subsets based on whether theirpental length.
Classification and Regression Tree (CART) 分类和回归树
There are many algorithms for Decision Trees. Scikit-Learn uses the CART algorithm, which produces only binary trees: nonleaf nodes always have two children. As you can tell from the name, the CART can be applied to both classification and regression problems. You can probably tell that there are also algorithms produces more than just two children; one example is ID3.
At nonleaf nodes, Decision Trees make a decision on how to further split the dataset based on certain criteria. How are these criteria decided? Intuitively, in a classification problem, we wish a split could separate instances of different classes as distinct as possible. In the best case, one subset will contain instances from one class, the other subset from the other class. In another word, we hope the subset aspureas possible. And the best split should give the least impurity.
Now we need some measurements of impurity. Here are two common ones.
is for nodes, subscript
for classes. For example, in the decision tree shown just now, at the top-level node,
It is easy to tell that these two metric changes almost the same way when
changes. So they are almost equivalent.
A reduction of entropy is often called aninformation gain.
Cost function for Classification 分类问题的代价函数
To minimize the impurity, we can subsequently define the cost function of each split as 以着最小化不纯度的标准，定义代价函数如下
'Cost function for classification'
Now we know how to split at a single node. Iterating this process on all the produced children node generates a decision tree.
When do we stop splitting? If we only stop when there is no more impurity, we are obviously just remembering the mapping, which is an extreme case of overfitting. To regularize, there are various hyperparameters to put an early stop to the decision tree growing. Common hyperparameters are , , etc. And again, cross-validation is our good friend on checking overfitting.
A Decision Tree can also estimate the probability that an instance belongs to a particular class k: first, it traverses the tree to find the leaf node for this instance, and then it returns the ratio of training instances of class k in this node.
We have previously mentioned that CART can also be used for regression problem. The CART algorithm works mostly the same way as earlier, except that instead of trying to split the training set in a way that minimizes impurity, it now tries to split the training set in a way that minimizes the MSE.
'Cost function for regression'
A single decision tree model suffers a few issues. For example, the split can only be orthogonal to the axes, which makes it suffer from dataset rotation. Besides, Decision Trees are sensitive to small variations in the training data.
How to overcome these issues? You will see it in the next summary, the Ensemble methods.
This article is part of a series of summaries of the book Hands-On Machine Learning with Scikit-Learn and TensorFlow. The summaries are meant to explain machine learning concepts and ideas, instead of covering the maths behind the models.
本文是《Hands-On Machine Learning with Scikit-Learn and TensorFlow》这本书的总结随笔系列的一部分。总结旨在解释机器学习的观念和想法，而不是数学和模型