专栏首页翻译scikit-learn CookbookUsing many Decision Trees – random forests使用多棵决策树--随机森林

Using many Decision Trees – random forests使用多棵决策树--随机森林

In this recipe, we'll use random forests for classification tasks. random forests are used because they're very robust to overfitting and perform well in a variety of situations.

在这部分,我们将使用随机森林来完成分类任务。随机森林由于对过拟合的稳健性和在众多情形下表现较好受青睐。

Getting ready准备工作

We'll explore this more in the How it works... section of this recipe, but random forests work by constructing a lot of very shallow trees, and then taking a vote of the class that each tree "voted" for. This idea is very powerful in machine learning. If we recognize that a simple trained classifier might only be 60 percent accurate, we can train lots of classifiers that are generally right and can then use the learners together.

我们将在“how it works”部分探索更多,但是随机森林通过构筑大量的浅层树来运行,然后每棵树对分类进行投票。这个思想在机器学习中非常强大,如果我们意识到简单的训练模型可能仅有60%的正确率,我们能够通过训练大量大致正确的分类器,达到组合后可用的情况。

How to do it…怎么做:

The mechanics of training a random forest classifier is very easy with scikit-learn. In this section,we'll do the following:

在scikit-learn训练机械性随机森林分类器非常简单,在这部分,我们将要按如下方法:

1. Create a sample dataset to practice with.

2. Train a basic random forest object.

3. Take a look at some of the attributes of a trained object.

1、生成用于练习的样本数据集。

2、训练一个基本的随机森林对象

3、观察训练对象的属性。

In the next recipe, we'll look at how to tune the random forest classifier. Let's start by importing datasets:

在下一步,我们将观察如何调试随机森林分类器,让我们从导入数据集开始

from sklearn import datasets

Then, create the dataset with 1,000 samples:然后,生成1000个样本的数据集:

X, y = datasets.make_classification(1000)

Now that we have the data, we can create a classifier object and train it:现在我们有了数据,生成分类器对象并训练它:

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X, y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

The first thing we want to do is see how well we fit the training data. We can use the predict method for these projections:

首要的是看一下我们拟合的训练数据效果怎么样。我们能使用预测方法来推断

print "Accuracy:\t", (y == rf.predict(X)).mean()
Accuracy: 0.998
print "Total Correct:\t", (y == rf.predict(X)).sum()
Total Correct: 998

Now, let's look at some attributes and methods.现在,让我们看看属性和方法。

First, we'll look at some of the useful attributes; in this case, since we used defaults, they'll be the object defaults:

首先,我们看一些有用的属性,在这个例子,我们使用的默认值,就是默认参数:

1、 rf.criterion : This is the criterion for how the splits are determined. The default is gini .

2、 rf.bootstrap : A Boolean that indicates whether we used bootstrap samples when training random forest.

3、 rf.n_jobs : The number of jobs to train and predict. If you want to use all the processors, set this to -1 . Keep in mind that if your dataset isn't very big, it often leads to more overhead in using multiple jobs due to the data having to be serialized and moved in between processes.

4、 rf.max_features : This denotes the number of features to consider when making the best split. This will come in handy during the tuning process.

5、 rf.compute_importances : This helps us decide whether to compute the importance of the features. See the There's more... section of this recipe for information on how to use this.

6、 rf.max_depth : This denotes how deep each tree can go.

1、rf.criterion:这是决定如何分割的原则,默认是gini

2、rf.bootstrap:这是布尔值来定义当训练随机森林时是否使用自助法(解决样本分布非正态问题)。

3、rf.n_jobs:训练和预测的运行次数。如果你想使用所有的处理器,设置它为-1.牢记如果你的数据集不是特别大,使用过多的次数将导致巨大开支,因为数据在过程中需要被序列化和删除。

4、rf.max_features:这表示使用最好的分割时考虑的特征数量。这将在调试过程中派上用场。

5、rf.compute_importances:这将帮助我们决定是否计算特征的权重。看看扩展阅读部分能获取更多如何使用它的信息。

6、rf.max_depth:这决定每棵树有多深。

There are more attributes to note; check out the official documentation for more details.The predict method isn't the only useful one. We can also get the probabilities of each class from individual samples. This can be a useful feature to understand the uncertainty in each prediction. For instance, we can predict the probabilities of each sample for the various classes:

这里有很多需要注意的特征,查看官方文档来获取更多细节。预测方法并不是唯一有效的方法。我们也可以使用独立样本中每个分类的概率。了解每个预测过程中的不确定性也是个有用的特征。例如,我们能预测每个样本对于不同分类的概率。

probs = rf.predict_proba(X)
import pandas as pd
probs_df = pd.DataFrame(probs, columns=['0', '1'])
probs_df['was_correct'] = rf.predict(X) == y
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(7, 5))
probs_df.groupby('0').was_correct.mean().plot(kind='bar', ax=ax)
ax.set_title("Accuracy at 0 class probability")
ax.set_ylabel("% Correct")
ax.set_xlabel("% trees for 0")

The following is the output:如下图所示:

How it works…怎么运行的

Random forest works by using a predetermined number of weak Decision Trees and by training each one of these trees on a subset of data. This is critical in avoiding overfitting. This is also the reason for the bootstrap parameter. We have each tree trained with the following:

随机森林通过预先定义弱决策树的数量并通过在部分数据上训练每棵树来运行。这关键是可以避免过拟合,这也是使用独立参数的原因。以下是对每棵树的训练:

1、 The class with the most votes

2、 The output, if we use regression trees There are, of course, performance considerations, which we'll cover in the next recipe, but for the purposes of understanding how random forests work, we train a bunch of average trees and get a fairly good classifier as a result.

1、得票最多的类

2、如果这里我们用回归树得到的输出,当然,将在下部分考虑其表现,但是为了理解随机森林如何工作这个目标,我们训练一组平均树,并得到一个较为公平的分类器。

There's more…扩展阅读

Feature importance is a good by-product of random forests. This often helps to answer the question: If we have 10 features, which features are most important in determining the true class of the data point? The real-world applications are hopefully easy to see. For example,if a transaction is fraudulent, we probably want to know if there are certain signals that can be used to figure out a transaction's class more quickly.

特征权重是随机森林一个很好的副产品。这经常对以下问题有帮助:如果我们有10个特征,哪一个特征是最重要的、能决定数据点真实分类的呢?在现实世界的应用希望它能简单的看出来,例如:如果一笔交易是欺诈的,我们可能希望知道它是否有确定的标志能快速的用于指出这个交易的类型。

If we want to calculate the feature importance, we need to state it when we create the object.If you use scikit-learn 0.15, you might get a warning that it is not required; in Version 0.16,the warning will be removed:

如果我们想要计算特征的权重,我们需要在创建对象的时候就做说明。如果你使用scikit-learn0.15版,你会得到个警告,因为这不是必须的,在0.16以后的版本,警告被删除了, 已经没有compute_importances=True这个参数

rf = RandomForestClassifier()
rf.fit(X, y)
f, ax = plt.subplots(figsize=(7, 5))
ax.bar(range(len(rf.feature_importances_)),rf.feature_importances_)
ax.set_title("Feature Importances")

The following is the output:输出如下

As we can see, certain features are much more important than others when determining if the outcome was of class 0 or class 1.

如我所见,在做分类为0或1的决定的过程中,确定特征较其他的相比更加重要。

原文链接:http://www.packtpub.com

原文作者:Trent Hauck

相关文章

  • Using LDA for classification使用LDA线性判别分析来分类

    Linear Discriminant Analysis (LDA) attempts to fit a linear combination of featu...

    到不了的都叫做远方
  • Tuning a random forest model调试随机森林模型

    In the previous recipe, we reviewed how to use the random forest classifier. In ...

    到不了的都叫做远方
  • Automatic cross validation自动交叉验证

    We've looked at the using cross validation iterators that scikit-learn comes wit...

    到不了的都叫做远方
  • What do we mean by “understanding” something?

    In this chapter, we shall examine the most fundamental ideas that we have about ...

    一个会写诗的程序员
  • 追踪接触者以控制COVID-19大流行(CS SI)

    控制 COVID-19大流行需要大量减少接触,主要是通过实施行动控制达到强制隔离的水平。 这导致了经济的大部分崩溃。这种疾病的携带者大约在接触病毒后3天具有传染...

    用户7095611
  • Instant Messaging at LinkedIn: Scaling to 10000 of Connections

    We recently introduced Instant Messaging on LinkedIn, complete with typing indic...

    首席架构师智库
  • 【论文推荐】最新6篇图像分割相关论文—隐马尔可夫随机场、级联三维全卷积、信号处理、全卷积网络、多源域适应、循环分割

    【导读】专知内容组整理了最近六篇图像分割(Image Segmentation)相关文章,为大家进行介绍,欢迎查看! 1.Combination of Hidd...

    WZEARW
  • 卷积神经网络反向传播推导

    Disclaimer: It is assumed that the reader is familiar with terms such as Multila...

    量化投资与机器学习微信公众号
  • How AI is Changing the Future of Web Development?

    How Artificial Intelligence is changing the future of web development? What is t...

    用户4822892
  • Maven日常 —— 你应该知道的一二三

    以前在日常工作中,使用Maven只是机械的执行Maven clean、Maven install,对其中的原理与过程并无了解,近期阅读了《Maven实战》,对...

    用户1154259

扫码关注云+社区

领取腾讯云代金券