前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >K-fold cross validation K-fold交叉验证

K-fold cross validation K-fold交叉验证

作者头像
到不了的都叫做远方
修改2020-05-06 11:45:35
7170
修改2020-05-06 11:45:35
举报
文章被收录于专栏:翻译scikit-learn Cookbook

In this recipe, we'll create, quite possibly, the most important post-model validation exercise—cross validation. We'll talk about k-fold cross validation in this recipe. There are several varieties of cross validation, each with slightly different randomization schemes.K-fold is perhaps one of the most well-known randomization schemes.

在这部分,我们将生成很可能是最重要的传播模型的检验练习-交叉验证。这部分我们将讨论K-fold交叉验证。这里有几种交叉验证的变量,每一种都有不同的随机化方案,K-fold可能是最著名的随机化方案之一

Getting ready准备工作

We'll create some data and then fit a classifier on the different folds. It's probably worth mentioning that if you can keep a holdout set, then that would be best. For example, we have a dataset where N = 1000 . If we hold out 200 data points, then use cross validation between the other 800 points to determine the best parameters.

我们生成一些数据然后拟合一个folds不同的分类器。值得一提的是如果你能保持一个始终不参与的集合,这可能是最好的状态,例如,我们有一个N为1000的数据集。如果我们选出200个数据点,然后在其他800个点上使用交叉验证来确定最佳的参数。

How to do it...怎么做

First, we'll create some fake data, then we'll examine the parameters, and finally, we'll look at the size of the resulting dataset:

首先,我们生成些虚拟数据,然后我们测试参数,最后,我们将关注结果数据的尺寸。

代码语言:javascript
复制
N = 1000
holdout = 200
from sklearn.datasets import make_regression
X, y = make_regression(1000, shuffle=True)

Now that we have the data, let's hold out 200 points, and then go through the fold scheme like we normally would:

现在,我们有了数据,让我们选出200个点,然后向我们平常做的那样,使用fold方案

代码语言:javascript
复制
X_h, y_h = X[:holdout], y[:holdout]
X_t, y_t = X[holdout:], y[holdout:]
from sklearn.cross_validation import KFold

K-fold gives us the option of choosing how many folds we want, if we want the values to be indices or Booleans, if want to shuffle the dataset, and finally, the random state (this is mainly for reproducibility). Indices will actually be removed in later versions. It's assumed to be True .

K-fold给我们选项来选择我们想要多少个folds,是否我们想要数值为复数形式或者布尔型,是否想要打乱数据集,最终,随机状态(这主要为了能复现)复数实际上在后续版本上会被删除,这将被假设为True。

Let's create the cross validation object:让我们生成交叉验证对象:

代码语言:javascript
复制
kfold = KFold(len(y_t), n_folds=4)

Now, we can iterate through the k-fold object:现在我们能够通过K-fold对象进行迭代:

代码语言:javascript
复制
output_string = "Fold: {}, N_train: {}, N_test: {}"
for i, (train, test) in enumerate(kfold.split(X_t)):
    print(output_string.format(i, len(y_t[train]), len(y_t[test])))
Fold: 0, N_train: 600, N_test: 200
Fold: 1, N_train: 600, N_test: 200
Fold: 2, N_train: 600, N_test: 200
Fold: 3, N_train: 600, N_test: 200

Each iteration should return the same split size.每一个迭代对象应该返回一个相同的分割尺寸。

How it works...如何运行的

It's probably clear, but k-fold works by iterating through the folds and holds out 1/n_folds * N , where N for us was len(y_t) .From a Python perspective, the cross validation objects have an iterator that can be accessed by using the in operator. Often times, it's useful to write a wrapper around a cross validation object that will iterate a subset of the data. For example, we may have a dataset that has repeated measures for data points or we may have a dataset with patients and each patient having measures.

这可能很清楚,但是K-fold通过fold值和算出1/n_folds * N的值来运行迭代,这里N就是len(y_t),自动算出,从python的角度,交叉验证对象有一个迭代器能够被使用者访问。常常,封装一个交叉验证的对象对迭代数据集合非常有用。例如,我们可能有个数据集对数据点有重复操作或者我们可能有个有问题的数据集并且每个问题都有其方法

We're going to mix it up and use pandas for this part:我们将混合它并且在这部分使用pandas库:

代码语言:javascript
复制
import numpy as np
import pandas as pd
patients = np.repeat(np.arange(0, 100, dtype=np.int8), 8)
measurements = pd.DataFrame({'patient_id': patients,'ys': np.random.normal(0, 1, 800)})

Now that we have the data, we only want to hold out certain customers instead of data points:现在我们有了数据,我们只想要选出确定的部分来代替数据点。

代码语言:javascript
复制
custids = np.unique(measurements.patient_id)
customer_kfold = KFold(n_splits=4)
output_string = "N_train: {}, N_test: {}"
for (train, test) in customer_kfold.split(custids):
    train_cust_ids = custids[train]
    training = measurements[measurements.patient_id.isin(train_cust_ids)]
    testing = measurements[~measurements.patient_id.isin(train_cust_ids)]
    print(output_string.format(len(training), len(testing)))
N_train: 600, N_test: 200
N_train: 600, N_test: 200
N_train: 600, N_test: 200
N_train: 600, N_test: 200

本文系外文翻译,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文系外文翻译前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
作者已关闭评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档