机器学习-决策树(Decision Tree)案例

背景介绍

这是我最喜欢的算法之一,我经常使用它。它是一种监督学习算法,主要用于分类问题。令人惊讶的是,它适用于分类和连续因变量。在该算法中,我们将总体分成两个或更多个同类集。这是基于最重要的属性/独立变量来完成的,以尽可能地作为不同的组。有关详细信息,请参阅简化决策树:https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/

在上图中,您可以看到人口根据多个属性分为四个不同的组,以识别“他们是否会玩”。为了将人口分成不同的异构群体,它使用各种技术,如基尼,信息增益,卡方,熵。

理解决策树如何工作的最好方法是玩Jezzball--一款来自微软的经典游戏(如下图所示)。基本上,你有一个移动墙壁的房间,你需要创建墙壁,以便最大限度的区域被球清除。

所以,每次你用墙隔开房间时,你都试图在同一个房间里创造2个不同的人口。决策树以非常类似的方式工作,通过将人口分成尽可能不同的群体。

接下来看使用Python Scikit-learn的决策树案例:

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')

# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)

train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']

test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']
model = DecisionTreeClassifier()
model.fit(train_x,train_y)

# depth of the decision tree
print('Depth of the Decision Tree :', model.get_depth())

# predict the target on the train dataset
predict_train = model.predict(train_x)
print('Target on train data',predict_train) 

# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('accuracy_score on train dataset : ', accuracy_train)

# predict the target on the test dataset
predict_test = model.predict(test_x)
print('Target on test data',predict_test) 

# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)

上面代码运行结果:

Shape of training data : (712, 25)
Shape of testing data : (179, 25)
Depth of the Decision Tree : 19
Target on train data [0 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0
 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 1 1 1 0 1 0 01 0 0 0 0 0
 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 1 11 0 0 0 0 0
 0 0 0 0 1 0 0 1 0 1 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 0 00 1 0 1 1 0
 0 0 0 1 1 0 0 1 0 0 1 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 0 0 00 0 1 0 0 1
 0 1 1 1 1 0 0 1 0 1 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 0 00 0 0 0 0 0
 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 1 01 0 0 0 1 0
 0 1 1 0 1 1 1 0 1 1 0 0 1 0 1 1 1 1 1 0 0 1 0 0 0 1 1 0 0 1 10 0 0 0 0 0
 0 0 1 1 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1 0 0 00 0 0 0 0 1
 1 0 0 1 1 0 1 0 0 0 1 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 00 0 1 0 0 0
 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 1 1 1 01 1 0 1 1 1
 0 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 11 0 0 1 0 0
 0 1 0 0 0 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 1 1 10 0 0 0 0 0
 0 0 1 1 1 0 0 1 0 1 1 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 00 0 0 1 0 1
 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 1 1 1 01 0 1 0 0 1
 0 0 0 1 1 0 0 1 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 1 0 1 0 0 0 0 11 0 1 1 1 0
 1 0 1 0 1 1 0 1 0 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 0 1 0 0 0 00 0 0 0 0 1
 0 0 0 1 0 1 1 1 1 0 1 1 0 0 1 0 1 0 0 1 0 0 1 1 1 1 0 1 0 0 01 0 1 0 1 0
 1 0 0 0 1 0 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 00 0 1 0 1 0
 1 0 1 1 1 0 0 1 0]
accuracy_score on train dataset :  0.9859550561797753
Target on test data [0 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 01 1 1 1 0 0 1 0 1 1 0 1 1 1 1 0
 1 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 1 1 1 0 1 1 1 0 1 1 1 0 0 0 00 1 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 1 11 0 1 1 0 1
 0 1 0 0 0 0 1 1 1 1 0 1 1 1 1 1 0 0 1 1 0 0 1 1 0 0 0 1 0 1 01 0 0 0 1 0
 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 0 1 1 0 1 0 0 0 0 0]
accuracy_score on test dataset :  0.770949720670391

原文发布于微信公众号 - yale记(python_yale_learning)

原文发表时间:2019-09-18

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

扫码关注云+社区

领取腾讯云代金券