Poor man's grid search穷举网格搜索

到不了的都叫做远方

修改于 2020-05-06 11:46:57

8060

修改于 2020-05-06 11:46:57

In this recipe, we're going to introduce grid search with basic Python, though we will use sklearn for the models and matplotlib for the visualization.

在这部分，我们将要介绍使用基本Python来网格搜索，不过我们将使用sklearn来完成模型并且用matplotlib来可视化。

Getting ready准备工作

In this recipe, we will perform the following tasks:在这部分，我们将展现以下主题：

1、 Design a basic search grid in the parameter space

2、 Iterate through the grid and check the loss/score function at each point in the parameter space for the dataset

3、 Choose the point in the parameter space that minimizes/maximizes the evaluation function

1、在参数空间设计一个基本的搜索网格

2、通过梯度迭代并在每一个点检查损失/得分函数

3、在参数空间里选择评估函数最小化/最大化的点

Also, the model we'll fit is a basic decision tree classifier. Our parameter space will be 2 dimensional to help us with the visualization:

同样，我们先拟合一个基本决策树分类器，我们的参数空间将是二维的来帮助我们实现可视化。

The parameter space will then be the Cartesian product of the those two sets:参数空间将是两个数据集的笛卡尔乘积

We'll see in a bit how we can iterate through this space with itertools .我们将看到我们能如何用迭代工具在这个空间迭代。

Let's create the dataset and then get started:让我们生成数据集并且开始：

from sklearn import datasets
X, y = datasets.make_classification(n_samples=2000, n_features=10)

How to do it...如何做

Earlier we said that we'd use grid search to tune two parameters— criteria and max_features . We need to represent those as Python sets, and then use itertools product to iterate through them:

早先我们说我们将使用网格搜索来调节两个参数-criteria and max_features。我们需要使用Python集合来代替他们，然后使用迭代工具来迭代它们：

criteria = {'gini', 'entropy'}
max_features = {'auto', 'log2', None}
import itertools as it
parameter_space = it.product(criteria, max_features)

Great! So now that we have the parameter space, let's iterate through it and check the accuracy of each model as specified by the parameters. Then, we'll store that accuracy so that we can compare different parameter spaces. We'll also use a test and train split of 50 , 50 :

很好，现在我们有了参数空间，然我们迭代它并且通过参数说明来检查每种模型的准确性、然后，我们将存贮正确的以便我们比较不同的参数空间。我们划分并使用一半训练集和一半测试集：

import numpy as np
train_set = np.random.choice([True, False], size=len(y))
from sklearn.tree import DecisionTreeClassifie
accuracies = {}
for criterion, max_feature in parameter_space:
dt = DecisionTreeClassifier(criterion=criterion,
max_features=max_feature)
dt.fit(X[train_set], y[train_set])
accuracies[(criterion, max_feature)] = (dt.predict(X[~train_set])== y[~train_set]).mean()
accuracies
 {('entropy', 'log2'): 0.9412360688956434,
 ('entropy', None): 0.9685916919959473,
 ('entropy', 'auto'): 0.9493414387031408,
 ('gini', 'log2'): 0.9645390070921985,
 ('gini', None): 0.9645390070921985,
 ('gini', 'auto'): 0.950354609929078}

So we now have the accuracies and its performance. Let's visualize the performance:所以我们现在有了正确率和他的表现，让我们可视化他们的表现：

from matplotlib import pyplot as plt
from matplotlib import cm
cmap = cm.RdBu_r
f, ax = plt.subplots(figsize=(7, 4))
ax.set_xticklabels([''] + list(criteria))
ax.set_yticklabels([''] + list(max_features))
plot_array = []
for max_feature in max_features:
    m = []
    for criterion in criteria:
        m.append(accuracies[(criterion, max_feature)])
        plot_array.append(m)
print(plot_array)
colors = ax.matshow(plot_array, cmap=cmap)
f.colorbar(colors)

The following is the output:输出如下：

It's fairly easy to see which one performed best here. Hopefully, you can see how this process can be taken to the further stage with a brute force method.

这很简单的看出哪个表现最好，有希望的是，你能看到采用这种暴力的方法并深入理解这种被采用的步骤。

How it works...如何运行的

This works fairly simply, we just have to perform the following steps:它运行非常简单，我们只需要展示以下步骤：

1. Choose a set of parameters.选择一个参数集

2. Iterate through them and find the accuracy of each step.迭代并找到每一步的准确率

3. Find the best performer by visual inspection.通过审视图形，找到表现最好的方法。

本文系外文翻译，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

python