# 机器学习实战之决策树

### 决策树算法原理

##### 熵

```-(5/14 * log(5/14,2) + 9/14 * log(9/14,2))
0.940```
##### 信息增益

• 晴天时：2/5打球，3/5不打球，熵为：
```-(2/5 * log(2/5,2) + 3/5 * log(3/5,2))
0.971```
• 阴天熵：0
• 雨天熵：0.971

##### 伪代码

```def createBranch():
检测数据集中的所有数据的分类标签是否相同:
If so return 类标签
Else:
寻找划分数据集的最好特征（划分之后信息熵最小，也就是信息增益最大的特征）
划分数据集
创建分支节点
for 每个划分的子集
调用函数 createBranch （创建分支的函数）并增加返回结果到分支节点中
return 分支节点```

### 决策树之海洋生物分类

##### 问题描述与数据

• 在水中是否生存
• 是否有脚蹼

```def creatDataSet():
dataSet = [
[1, 1, 'yes'],
[1, 1, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']
]
labels = ['no surfacing', 'flippers']
return dataSet, labels```

##### 计算熵

```from math import log

def calcshannon(dataSet):
num = len(dataSet)
labelCounts = {}
for featVec in dataSet:
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannon = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/num
shannon -= prob * log(prob, 2)
return shannon```

##### 划分数据集

```def splitDataSet(dataSet, axis, value):
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:
reduce = featVec[:axis]
reduce.extend(featVec[axis+1:])
retDataSet.append(reduce)
return retDataSet```
```def choose(dataSet):
numfeature = len(dataSet[0]) - 1
baseEntropy = calcshannon(dataSet)
bestinfogain = 0.0
bestfeature = -1
for i in range(numfeature):
featlist = [example[i] for example in dataSet]
vals = set(featlist)
newEntropy = 0.0
for value in vals:
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet)/float(len(dataSet))
newEntropy += prob*calcshannon(subDataSet)
infoGain = baseEntropy - newEntropy
if (infoGain > bestinfogain):
bestinfogain = infoGain
bestfeature = i
return bestfeature```
##### 创建树

```import operator
def majority(classList):
classCount = {}
for vote in classList:
if vote not in classCount.keys():
classCount[vote] = 0
classCount[vote] += 1
sortedcount = sorted(classCount.items(),  key=operator.itemgetter(1), reverse=True)
return sortedcount[0][0]```

```def createTree(dataSet, labels):
classList = [example[-1] for example in dataSet]
if classList.count(classList[0]) == len(classList):
return classList[0]
if len(dataSet[0]) == 1:
return majority(classList)
bestFeat = choose(dataSet)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel:{}}
del (labels[bestFeat])
Vals = [example[bestFeat] for example in dataSet]
uvals = set(Vals)
for value in uvals:
sublabels = labels[:]
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), sublabels)
return myTree```

• 优点：利于理解
• 缺点：容易过拟合

143 篇文章46 人订阅

0 条评论

## 相关文章

3.2K10

3208

### 【重磅】深度学习顶会ICLR2018评审结果出炉,一文快速了解评审分析简报和评分最高的十篇论文

【导读】ICLR，全称为「International Conference on Learning Representations」（国际学习表征会议），201...

3335

3804

4738

3823

3131

1601

1294

883