# Using many Decision Trees – random forests使用多棵决策树--随机森林

In this recipe, we'll use random forests for classification tasks. random forests are used because they're very robust to overfitting and perform well in a variety of situations.

We'll explore this more in the How it works... section of this recipe, but random forests work by constructing a lot of very shallow trees, and then taking a vote of the class that each tree "voted" for. This idea is very powerful in machine learning. If we recognize that a simple trained classifier might only be 60 percent accurate, we can train lots of classifiers that are generally right and can then use the learners together.

How to do it…怎么做：

The mechanics of training a random forest classifier is very easy with scikit-learn. In this section,we'll do the following:

1. Create a sample dataset to practice with.

2. Train a basic random forest object.

3. Take a look at some of the attributes of a trained object.

1、生成用于练习的样本数据集。

2、训练一个基本的随机森林对象

3、观察训练对象的属性。

In the next recipe, we'll look at how to tune the random forest classifier. Let's start by importing datasets:

`from sklearn import datasets`

Then, create the dataset with 1,000 samples:然后，生成1000个样本的数据集：

`X, y = datasets.make_classification(1000)`

Now that we have the data, we can create a classifier object and train it:现在我们有了数据，生成分类器对象并训练它：

```from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X, y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)```

The first thing we want to do is see how well we fit the training data. We can use the predict method for these projections:

```print "Accuracy:\t", (y == rf.predict(X)).mean()
Accuracy: 0.998
print "Total Correct:\t", (y == rf.predict(X)).sum()
Total Correct: 998```

Now, let's look at some attributes and methods.现在，让我们看看属性和方法。

First, we'll look at some of the useful attributes; in this case, since we used defaults, they'll be the object defaults:

1、 rf.criterion : This is the criterion for how the splits are determined. The default is gini .

2、 rf.bootstrap : A Boolean that indicates whether we used bootstrap samples when training random forest.

3、 rf.n_jobs : The number of jobs to train and predict. If you want to use all the processors, set this to -1 . Keep in mind that if your dataset isn't very big, it often leads to more overhead in using multiple jobs due to the data having to be serialized and moved in between processes.

4、 rf.max_features : This denotes the number of features to consider when making the best split. This will come in handy during the tuning process.

5、 rf.compute_importances : This helps us decide whether to compute the importance of the features. See the There's more... section of this recipe for information on how to use this.

6、 rf.max_depth : This denotes how deep each tree can go.

1、rf.criterion:这是决定如何分割的原则，默认是gini

2、rf.bootstrap：这是布尔值来定义当训练随机森林时是否使用自助法（解决样本分布非正态问题）。

3、rf.n_jobs：训练和预测的运行次数。如果你想使用所有的处理器，设置它为-1.牢记如果你的数据集不是特别大，使用过多的次数将导致巨大开支，因为数据在过程中需要被序列化和删除。

4、rf.max_features：这表示使用最好的分割时考虑的特征数量。这将在调试过程中派上用场。

5、rf.compute_importances:这将帮助我们决定是否计算特征的权重。看看扩展阅读部分能获取更多如何使用它的信息。

6、rf.max_depth:这决定每棵树有多深。

There are more attributes to note; check out the official documentation for more details.The predict method isn't the only useful one. We can also get the probabilities of each class from individual samples. This can be a useful feature to understand the uncertainty in each prediction. For instance, we can predict the probabilities of each sample for the various classes:

```probs = rf.predict_proba(X)
import pandas as pd
probs_df = pd.DataFrame(probs, columns=['0', '1'])
probs_df['was_correct'] = rf.predict(X) == y
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(7, 5))
probs_df.groupby('0').was_correct.mean().plot(kind='bar', ax=ax)
ax.set_title("Accuracy at 0 class probability")
ax.set_ylabel("% Correct")
ax.set_xlabel("% trees for 0")```

The following is the output:如下图所示：

How it works…怎么运行的

Random forest works by using a predetermined number of weak Decision Trees and by training each one of these trees on a subset of data. This is critical in avoiding overfitting. This is also the reason for the bootstrap parameter. We have each tree trained with the following:

1、 The class with the most votes

2、 The output, if we use regression trees There are, of course, performance considerations, which we'll cover in the next recipe, but for the purposes of understanding how random forests work, we train a bunch of average trees and get a fairly good classifier as a result.

1、得票最多的类

2、如果这里我们用回归树得到的输出，当然，将在下部分考虑其表现，但是为了理解随机森林如何工作这个目标，我们训练一组平均树，并得到一个较为公平的分类器。

There's more…扩展阅读

Feature importance is a good by-product of random forests. This often helps to answer the question: If we have 10 features, which features are most important in determining the true class of the data point? The real-world applications are hopefully easy to see. For example,if a transaction is fraudulent, we probably want to know if there are certain signals that can be used to figure out a transaction's class more quickly.

If we want to calculate the feature importance, we need to state it when we create the object.If you use scikit-learn 0.15, you might get a warning that it is not required; in Version 0.16,the warning will be removed:

```rf = RandomForestClassifier()
rf.fit(X, y)
f, ax = plt.subplots(figsize=(7, 5))
ax.bar(range(len(rf.feature_importances_)),rf.feature_importances_)
ax.set_title("Feature Importances")```

The following is the output:输出如下

As we can see, certain features are much more important than others when determining if the outcome was of class 0 or class 1.

• ### Using LDA for classification使用LDA线性判别分析来分类

Linear Discriminant Analysis (LDA) attempts to fit a linear combination of featu...

• ### Tuning a random forest model调试随机森林模型

In the previous recipe, we reviewed how to use the random forest classifier. In ...

• ### Automatic cross validation自动交叉验证

We've looked at the using cross validation iterators that scikit-learn comes wit...

• ### What do we mean by “understanding” something?

In this chapter, we shall examine the most fundamental ideas that we have about ...

• ### 追踪接触者以控制COVID-19大流行（CS SI）

控制 COVID-19大流行需要大量减少接触，主要是通过实施行动控制达到强制隔离的水平。 这导致了经济的大部分崩溃。这种疾病的携带者大约在接触病毒后3天具有传染...

• ### Instant Messaging at LinkedIn: Scaling to 10000 of Connections

We recently introduced Instant Messaging on LinkedIn, complete with typing indic...

• ### 【论文推荐】最新6篇图像分割相关论文—隐马尔可夫随机场、级联三维全卷积、信号处理、全卷积网络、多源域适应、循环分割

【导读】专知内容组整理了最近六篇图像分割（Image Segmentation）相关文章，为大家进行介绍，欢迎查看! 1.Combination of Hidd...

• ### 卷积神经网络反向传播推导

Disclaimer: It is assumed that the reader is familiar with terms such as Multila...

• ### How AI is Changing the Future of Web Development?

How Artificial Intelligence is changing the future of web development? What is t...

• ### Maven日常 —— 你应该知道的一二三

以前在日常工作中，使用Maven只是机械的执行Maven clean、Maven install,对其中的原理与过程并无了解，近期阅读了《Maven实战》，对...

### 活动推荐 