# Machine Learning Notes-Decision Trees-Udacity

Decision Tree 可以把 Input 映射到离散的 Labels。对每个节点上的 Attribute 提问，取不同的 Value 走向不同的 Children，最终得到结果。

#### 用 sklearn 来 create 和 train Decision Trees。

Step-1: Decision Tree Classifier

```def classify(features_train, labels_train):

### your code goes here--should return a trained decision tree classifer
from sklearn import tree

clf=tree.DecisionTreeClassifier()
clf=clf.fit(features_train, labels_train)

return clf```
```#!/usr/bin/python

""" lecture and example code for decision tree unit """

import sys
from class_vis import prettyPicture, output_image
from prep_terrain_data import makeTerrainData

import matplotlib.pyplot as plt
import numpy as np
import pylab as pl
from classifyDT import classify

features_train, labels_train, features_test, labels_test = makeTerrainData()

### the classify() function in classifyDT is where the magic
### happens--fill in this function in the file 'classifyDT.py'!
clf = classify(features_train, labels_train)

#### grader code, do not modify below this line

prettyPicture(clf, features_test, labels_test)

Decision Tree Boundary 很独特，像现代艺术，还有一些小岛。 但是有些 Overfitting，

Step-2: Accuracy

```import sys
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData

import numpy as np
import pylab as pl

features_train, labels_train, features_test, labels_test = makeTerrainData()

#################################################################################

########################## DECISION TREE #################################

from sklearn import tree
clf=tree.DecisionTreeClassifier()
clf=clf.fit(features_train, labels_train)
labels_predict=clf.predict(features_test)

from sklearn.metrics import accuracy_score

acc = accuracy_score(labels_test,labels_predict)
### you fill this in!
### be sure to compute the accuracy on the test set

def submitAccuracies():
return {"acc":round(acc,3)}```

Step-3: 接下来看哪些 Parameters 可以 Tune

Resource: Parameters of Decision Tree http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

DecisionTreeClassifier 有如下几个 Parameters

class sklearn.tree. DecisionTreeClassifier (criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, class_weight=None, presort=False)

```import sys
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData

import matplotlib.pyplot as plt
import numpy as np
import pylab as pl

features_train, labels_train, features_test, labels_test = makeTerrainData()

########################## DECISION TREE #################################

### your code goes here--now create 2 decision tree classifiers,
### one with min_samples_split=2 and one with min_samples_split=50
### compute the accuracies on the testing data and store
### the accuracy numbers to acc_min_samples_split_2 and
### acc_min_samples_split_50, respectively

from sklearn import tree
clf_2=tree.DecisionTreeClassifier(min_samples_split=2)
clf_2=clf_2.fit(features_train, labels_train)
labels_predict_2=clf_2.predict(features_test)

clf_50=tree.DecisionTreeClassifier(min_samples_split=50)
clf_50=clf_50.fit(features_train, labels_train)
labels_predict_50=clf_50.predict(features_test)

from sklearn.metrics import accuracy_score

acc_min_samples_split_2=accuracy_score(labels_test,labels_predict_2)
acc_min_samples_split_50=accuracy_score(labels_test,labels_predict_50)

def submitAccuracies():
return {"acc_min_samples_split_2":round(acc_min_samples_split_2,3),
"acc_min_samples_split_50":round(acc_min_samples_split_50,3)}```

`{"message": "{'acc_min_samples_split_50': 0.912, 'acc_min_samples_split_2': 0.908}"}`

#### 熵，很重要，决定着 Decision Tree 如何划分 data。

Definition: measure of impurity of a bunch of examples.

Formular:

Decision Tree 就是要最大化 Information Gain

Entropy of Parent，即 speed

Bias and Variance

Strengths and Weakness

#### Weakness:

prone to overfitting: when lots of features, complicate tree so, need to tune parameters, stop the growth of trees at appropriate time.

#### Strengths:

Ensemble method: Build bigger classifier out of decision trees,

281 篇文章47 人订阅

0 条评论

## 相关文章

### 【学术】一篇关于机器学习中的稀疏矩阵的介绍

AiTechYun 编辑：Yining 在矩阵中，如果数值为0的元素数目远远多于非0元素的数目，并且非0元素分布无规律时，则称该矩阵为稀疏矩阵；与之相反，若非0...

81140

20620

45760

40260

10330

14510

### 收藏 | 机器学习、NLP、Python和Math最好的150余个教程

? 尽管机器学习的历史可以追溯到1959年，但目前，这个领域正以前所未有的速度发展。最近，我一直在网上寻找关于机器学习和NLP各方面的好资源，为了帮助到和我有...

25250

35060

20320

91170