前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Tuning a random forest model调试随机森林模型

Tuning a random forest model调试随机森林模型

作者头像
到不了的都叫做远方
修改2020-04-29 15:55:00
1.3K0
修改2020-04-29 15:55:00
举报

In the previous recipe, we reviewed how to use the random forest classifier. In this recipe,we'll walk through how to tune its performance by tuning its parameters.

在前一部分,我们回顾了如何使用随机森林分类器,这部分,我们将学习如何通过调整参数来调试模型的表现。

Getting ready准备工作

In order to tune a random forest model, we'll need to first create a dataset that's a little more difficult to predict. Then, we'll alter the parameters and do some preprocessing to fit the dataset better.

为了调试随机森林,我们需要先创建一个比较难预测的数据集,然后我们调整参数,之前对数据集做一些数据预处理会拟合的更好些。

So, let's create the dataset first:所以,先让我们生成数据集:

from sklearn import datasets
X, y = datasets.make_classification(n_samples=10000,n_features=20,
                                    n_informative=15,flip_y=.5, weights=[.2, .8])

How to do it…怎么做的

In this recipe, we will do the following:在这部分,我们将进行以下步骤:

1. Create a training and test set. We won't just sail through this recipe like we did in the previous recipe. It's an empty deed to tune a model without comparing it to a training set.

2. Fit a baseline random forest to evaluate how well we do with a naive algorithm.

3. Alter some parameters in a systematic way, and then observe what happens to the fit.

1、生成训练集和测试集。我们只要按着之前的方法做就行,如果调试一个模型没有与训练集的差别那就是一纸空文。

2、拟合一个随机森林的基准线,来评估我们使用朴素算法的表现

3、系统的调节一些参数,并观察拟合过程发生了什么

Ok, start an interpreter and import NumPy:好了,让我们打开解释器,导入numpy:

import numpy as np
training = np.random.choice([True, False], p=[.8, .2],size=y.shape)
from sklearn.ensemble import RandomForestClassifie
rf = RandomForestClassifier()
rf.fit(X[training], y[training])
preds = rf.predict(X[~training])
print "Accuracy:\t", (preds == y[~training]).mean()
Accuracy: 0.652239557121

I'm going to cheat a little bit and introduce one of the model evaluation metrics we will talk about later in the book. Accuracy is a good first metric, but using a confusion matrix will help us understand what's going on.Let's iterate through the recommended choices for max_features and see what it does to the fit. We'll also iterate through a couple of floats, which are the fraction of the features that will be used. Use the following commands to do so:

我有点超前,我将介绍一个本书后面部分会讲到的模型评估指标。准确性是一个好的首要指标,但是使用混淆矩阵将帮助我们了解更多,让我们通过迭代max_features的推荐选择来看看拟合过程中会发生什么,我们通过拟合大量的浮点数,小部分我们使用的特征值。使用以下代码来完成:

from sklearn.metrics import confusion_matrix
max_feature_params = ['auto', 'sqrt', 'log2', .01, .5, .99]
confusion_matrixes = {}
for max_feature in max_feature_params:
    rf = RandomForestClassifier(max_features=max_feature)
    rf.fit(X[training], y[training])
confusion_matrixes[max_feature] = confusion_matrix(y[~training]),
                                                    rf.predict(X[~training])).ravel()

Now, import pandas and look at the confusion matrix we just created:现在,导入pandas来看一看我们生成的混淆矩阵:

Since I used the ravel method, our 2D confusion matrices are now 1D.当我使用了ravel方法,我们的二维矩阵变成了一维。

import pandas as pd
confusion_df = pd.DataFrame(confusion_matrixes)
import itertools
from matplotlib import pyplot as plt
f, ax = plt.subplots(figsize=(7, 5))
confusion_df.plot(kind='bar', ax=ax)
ax.legend(loc='best')
ax.set_title("Guessed vs Correct (i, j) where i is the guess and j isthe actual.")
ax.grid()
ax.set_xticklabels([str((i, j)) for i, j in list(itertools.product(range(2), range(2)))]);
ax.set_xlabel("Guessed vs Correct")
ax.set_ylabel("Correct")

The following is the output:如下图所示

While we didn't see any real difference in performance, this is a fairly simple process to go through for your own projects. Let's try it on the choice of n_estimator instances, but use raw accuracy. With more than a few options, our graph is going to become very cloudy and difficult to use.

然而我们看不出表现得真正的不同,这对你自己的项目来说是个简单的过程,让我们尝试选择n_estimator的例子,但是使用的正确性。通过不少的选择,我们的图形将变成非常不清楚和不同。

Since we're using the confusion matrix, we can get the accuracy from the trace of the confusion matrix divided by the overall sum:

自从我们使用混淆矩阵,我们能通过总和区分的混淆矩阵的迹来得到准确性:

n_estimator_params = range(1, 20)
confusion_matrixes = {}
for n_estimator in n_estimator_params:
    rf = RandomForestClassifier(n_estimators=n_estimator)
    rf.fit(X[training], y[training])
    confusion_matrixes[n_estimator] = confusion_matrix(y[~training], rf.predict(X[~training]))
    accuracy = lambda x: np.trace(x) / np.sum(x, dtype=float)
    confusion_matrixes[n_estimator] =accuracy(confusion_matrixes[n_estimator])
accuracy_series = pd.Series(confusion_matrixes)

# here's where we'll update the confusion matrix with the operation we talked about这里就是我们讨论要通过操作更新混淆矩阵的地方

import itertools
from matplotlib import pyplot as plt
f, ax = plt.subplots(figsize=(7, 5))
accuracy_series.plot(kind='bar', ax=ax, color='k', alpha=.75)
ax.grid()
ax.set_title("Accuracy by Number of Estimators")
ax.set_ylim(0, 1) # we want the full scope
ax.set_ylabel("Accuracy")
ax.set_xlabel("Number of Estimators")

The following is the output:如下图所示

Notice how accuracy is going up for the most part. There certainly is some randomness associated with the accuracy, but the graph is up and to the right. In the following How it works... section, we'll talk about the association between random forest and bootstrap,and what is generally better.

注意准确性在大部分情况下是如何增长的,这里肯定有一些非随机性与准确性有关,但是图越往右越高。在以下如何做的部分,我们将讨论随机森林和独立的联系与区别,以及哪个总体来说更好些。

How it works…怎么运行的:

Bootstrapping is a nice technique to the other parts of modeling. The case often used to introduce bootstrapping is adding standard errors to a median. Here, we just estimate the outcome over and over and aggregate the estimates up to probabilities.

Bootstrapping是一个很好的技术来提高其他模型,它经常被用于介绍bootstrapping是均值加上标准差。这里我们一遍又一遍评估输出并将估计值的概率求和。

So, by simply increasing the number estimators, we increase the subsamples that lead to an overall faster convergence.

所以,通过简单的增减估计值的数量,我们增加子样本来让融合变得更快。

There's more…扩展阅读

We might want to speed up the training process. I alluded to this process earlier, but we can set n_jobs to the number of trees we want to train at the same time. This should roughly be the number of cores on the machine:

我们可能想加快训练过程的速度,我之前暗示过这个步骤,但是我们能同样能对树的数量设置我们想同时训练的 n_jobs,这根据机器的核心确定。

rf = RandomForestClassifier(n_jobs=4, verbose=True)
rf.fit(X, y)
[Parallel(n_jobs=4)]: Done 1 out of 4 | elapsed: 0.3s remaining: 0.9s
[Parallel(n_jobs=4)]: Done 4 out of 4 | elapsed: 0.3s finished

This will also predict in parallel (verbosely):显然这将同步做预测:

rf.predict(X)
[Parallel(n_jobs=4)]: Done 1 out of 4 | elapsed: 0.0s remaining:
0.0s
[Parallel(n_jobs=4)]: Done 4 out of 4 | elapsed: 0.0s finished
array([1, 1, 0, ..., 1, 1, 1])

本文系外文翻译,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文系外文翻译前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
作者已关闭评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档