[机器学习]python3构建决策树

周小董

发布于 2019-03-25 11:20:01

8360

发布于 2019-03-25 11:20:01

文章被收录于专栏：python前行者

1. 什么是决策树/判定树（decision tree)?

判定树是一个类似于流程图的树结构：其中，每个内部结点表示在一个属性上的测试，每个分支代表一个属性输出，而每个树叶结点代表类或类分布。树的最顶层是根结点。

本次构建决策树的使用的算法是ID3算法，主要思想是利用不同特征值的信息熵来作为最优划分属性

csv文件

RID,age,income,student,credit_rating,class_buys_computer
1,youth,high,no,fair,no
2,youth,high,no,excellent,no
3,middle_aged,high,no,fair,yes
4,senior,medium,no,fair,yes
5,senior,low,yes,fair,yes
6,senior,low,yes,excellent,no
7,middle_aged,low,yes,excellent,yes
8,youth,medium,no,fair,no
9,youth,low,yes,fair,yes
10,senior,medium,yes,fair,yes
11,youth,medium,yes,excellent,yes
12,middle_aged,medium,no,excellent,yes
13,middle_aged,high,yes,fair,yes
14,senior,medium,no,excellent,no

步骤：

1、将以上列表存为.csv格式文件，读取特征值列表和Class列表 2、将特征值列表和Class列表转换为（0，1）形式 3、运用决策树分类 4、使模型可视化 5、利用测试集进行预测测试

数据分析

# Read in the csv file and put features into list of dict and list of class label
#注意使用csv.reader时，open参数如设置为’rb’会报错
allElectronicsData = open(r'./AllElectronics.csv', 'r')
reader = csv.reader(allElectronicsData)
headers = next(reader)#读取文件第一行
print(headers)

featureList = []
labelList = []
for row in reader:#继续读取后续内容
    labelList.append(row[len(row)-1])#读取类别，即每一行的最后一个数据
    rowDict = {}
    for i in range(1, len(row)-1):
        rowDict[headers[i]] = row[i]
    featureList.append(rowDict)

注释1：注意使用csv.reader时，open参数如设置为’rb’会报错，具体见http://blog.csdn.net/darlingwood2013/article/details/70858086 注释2：版本3.2以前写法为reader.next()。next后reader指向下一行，因此后续的for循环中，row依次为第二行至最后一行，labelList的赋值中不会再出现第一行的内容。可用allElectronicsData.seek(0)返回文件开始位置

seek() 方法 用于移动文件读取指针到指定位置。语法如下： fileObject.seek(offset[, whence]) 参数： offset – 开始的偏移量，也就是代表需要移动偏移的字节数 whence：可选，默认值为 0。给offset参数一个定义，表示要从哪个位置开始偏移；0代表从文件开头开始算起，1代表从当前位置开始算起，2代表从文件末尾算起。该函数没有返回值。

next() 方法 Python 3 中的 File 对象不支持 next() 方法。 Python 3 的内置函数 next() 通过迭代器调用 next() 方法返回下一项。在循环中，next()方法会在每次循环中调用，该方法返回文件的下一行，如果到达结尾(EOF),则触发 StopIteration语法语法如下： next(iterator[,default]) 参数：无返回值：返回文件下一行。

源码：

# -*- coding:utf-8 -*-
from sklearn.feature_extraction import DictVectorizer
import csv
from sklearn import tree
from sklearn import preprocessing
from sklearn.externals.six import StringIO

# Read in the csv file and put features into list of dict and list of class label
allElectronicsData = open(r'./AllElectronics.csv', 'r')
reader = csv.reader(allElectronicsData)
headers = next(reader)
print(headers)

featureList = []
labelList = []
for row in reader:
    labelList.append(row[len(row)-1])
    rowDict = {}
    for i in range(1, len(row)-1):
        rowDict[headers[i]] = row[i]
    featureList.append(rowDict)

print(featureList)

# Vetorize features 将特征列表转化为字符型的dummy v
vec = DictVectorizer()#实例化
dummyX = vec.fit_transform(featureList) .toarray()

print("dummyX: " + str(dummyX))
print(vec.get_feature_names())

print("labelList: " + str(labelList))

# vectorize class labels 将Class转化为dummyY
lb = preprocessing.LabelBinarizer()#实例化
dummyY = lb.fit_transform(labelList)
print("dummyY: " + str(dummyY))

# Using decision tree for classification (运用决策树分类)
# clf = tree.DecisionTreeClassifier()
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(dummyX, dummyY)# 用训练数据拟合分类器模型
print("clf: " + str(clf))


# Visualize model  使模型可视化
with open("./allElectronicInformationGainOri.dot", 'w') as f:
    f = tree.export_graphviz(clf, feature_names=vec.get_feature_names(), out_file=f)
#测试集
oneRowX = dummyX[0, :]
print("oneRowX: " + str(oneRowX))

newRowX = oneRowX
newRowX[0] = 1
newRowX[2] = 0
print("newRowX: " + str(newRowX))

predictedY = clf.predict([newRowX])# 用训练好的分类器去预测
print("predictedY: " + str(predictedY))

运行结果：

['RID', 'age', 'income', 'student', 'credit_rating', 'class_buys_computer']
[{'age': 'youth', 'income': 'high', 'student': 'no', 'credit_rating': 'fair'}, {'age': 'youth', 'income': 'high', 'student': 'no', 'credit_rating': 'excellent'}, {'age': 'middle_aged', 'income': 'high', 'student': 'no', 'credit_rating': 'fair'}, {'age': 'senior', 'income': 'medium', 'student': 'no', 'credit_rating': 'fair'}, {'age': 'senior', 'income': 'low', 'student': 'yes', 'credit_rating': 'fair'}, {'age': 'senior', 'income': 'low', 'student': 'yes', 'credit_rating': 'excellent'}, {'age': 'middle_aged', 'income': 'low', 'student': 'yes', 'credit_rating': 'excellent'}, {'age': 'youth', 'income': 'medium', 'student': 'no', 'credit_rating': 'fair'}, {'age': 'youth', 'income': 'low', 'student': 'yes', 'credit_rating': 'fair'}, {'age': 'senior', 'income': 'medium', 'student': 'yes', 'credit_rating': 'fair'}, {'age': 'youth', 'income': 'medium', 'student': 'yes', 'credit_rating': 'excellent'}, {'age': 'middle_aged', 'income': 'medium', 'student': 'no', 'credit_rating': 'excellent'}, {'age': 'middle_aged', 'income': 'high', 'student': 'yes', 'credit_rating': 'fair'}, {'age': 'senior', 'income': 'medium', 'student': 'no', 'credit_rating': 'excellent'}]
dummyX: [[0. 0. 1. 0. 1. 1. 0. 0. 1. 0.]
 [0. 0. 1. 1. 0. 1. 0. 0. 1. 0.]
 [1. 0. 0. 0. 1. 1. 0. 0. 1. 0.]
 [0. 1. 0. 0. 1. 0. 0. 1. 1. 0.]
 [0. 1. 0. 0. 1. 0. 1. 0. 0. 1.]
 [0. 1. 0. 1. 0. 0. 1. 0. 0. 1.]
 [1. 0. 0. 1. 0. 0. 1. 0. 0. 1.]
 [0. 0. 1. 0. 1. 0. 0. 1. 1. 0.]
 [0. 0. 1. 0. 1. 0. 1. 0. 0. 1.]
 [0. 1. 0. 0. 1. 0. 0. 1. 0. 1.]
 [0. 0. 1. 1. 0. 0. 0. 1. 0. 1.]
 [1. 0. 0. 1. 0. 0. 0. 1. 1. 0.]
 [1. 0. 0. 0. 1. 1. 0. 0. 0. 1.]
 [0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]
['age=middle_aged', 'age=senior', 'age=youth', 'credit_rating=excellent', 'credit_rating=fair', 'income=high', 'income=low', 'income=medium', 'student=no', 'student=yes']
labelList: ['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'no']
dummyY: [[0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]]
clf: DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
oneRowX: [0. 0. 1. 0. 1. 1. 0. 0. 1. 0.]
newRowX: [1. 0. 0. 0. 1. 1. 0. 0. 1. 0.]
predictedY: [1]

可以看到对于一个输入数据，成功的进行了预测。

此外，利用graphviz还可以很方便的将程序过程中生成的.dot文件转化为pdf文件进行显示决策树的样子，具体方法是在终端下输入：dot -Tpdf name.dot -o name1.pdf,在这个程序中生成的决策树如下图所示：

安装 Graphviz： http://www.graphviz.org/

配置环境变量

转化dot文件至pdf可视化决策树：dot -Tpdf iris.dot -o outpu.pdf

决策树归纳算法（ID3）

1970-1980， J.Ross. Quinlan, ID3算法

选择属性判断结点

信息获取量(Information Gain)：Gain(A) = Info(D) - Infor_A(D)

通过A来作为节点分类获取了多少信息

类似，Gain(income) = 0.029, Gain(student) = 0.151, Gain(credit_rating)=0.048

所以，选择age作为第一个根节点

二、遇到的问题及解决办法

1、打开.csv文件出错的情况及解决办法如果出现下图情况，第一列数据乱码，那就是你存为.csv文件时候的选择错了

应该选择如图所示：

下图所示选择会出现乱码：

2、如果出现这样的错误提示：AttributeError: ‘_csv.reader’ object has no attribute ‘next’

需要更改headers = reader.next() 为headers = next(reader)即可，这应该是Python3和Python2的区别

3、模型可视化时.dot文件转化为graphviz注意的问题在cmd命令中输入dot -Tpdf iris.dot -o output.pdf (1) 注意iris.dot为你的.dot文件所在路径 (2）输出文件路径为cmd命令行前面的路径

4、测试集预测出现的问题： array=[ 1. 0. 0. 0. 1. 1. 0. 0. 1. 0.]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample. 如图所示：