前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Stratified k-fold K-fold分层

Stratified k-fold K-fold分层

作者头像
到不了的都叫做远方
修改2020-05-06 11:46:47
9210
修改2020-05-06 11:46:47
举报
文章被收录于专栏:翻译scikit-learn Cookbook

In this recipe, we'll quickly look at stratified k-fold valuation. We've walked through different recipes where the class representation was unbalanced in some manner. Stratified k-fold is nice because its scheme is specifically designed to maintain the class proportions.

在这部分,我们将要快速看一下k-fold分层估计。我们已经学过了关于分类表现不均匀的多种方法。k-fold分层因为它的特殊标注的结构来保持分类比例。

Getting ready准备工作

We're going to create a small dataset. In this dataset, we will then use stratified k-fold validation.We want it small so that we can see the variation. For larger samples. it probably won't be as big of a deal.We'll then plot the class proportions at each step to illustrate how the class proportions are maintained:

我们将生成一个小的数据集,在这个数据集,我们使用k-fold分层估计,我们想让它小到我们可以看到方差。对于大样本,它可能不是一笔大交易。我们在每一步去画出分类比例来说明如何保持分类比例。

代码语言:javascript
复制
from sklearn import datasets
X, y = datasets.make_classification(n_samples=int(1e3),weights=[1./11])

Let's check the overall class weight distribution:我们检验所有类的权重分布:

代码语言:javascript
复制
y.mean()
0.904

Roughly, 90.5 percent of the samples are 1, with the balance 0.大概的,90.4%的样本是1,剩余的是0。

How to do it...怎么做

Let's create a stratified k-fold object and iterate it through each fold. We'll measure the proportion of verse that are 1. After that we'll plot the proportion of classes by the split number to see how and if it changes. This code will hopefully illustrate how this is beneficial.We'll also plot this code against a basic ShuffleSplit:

让我们生成一个k-fold分层对象然后通过每一层来迭代它。我们测量1的比例,然后,我们通过份额数量来画出类别比例来看看如果变化了是怎样的变化。这些代码将有希望说明这是怎样有益的。我们将要同时画出基本ShuffleSplit的图形

代码语言:javascript
复制
from sklearn.model_selection import cross_val_score, StratifiedKFold, ShuffleSplit
n_folds = 50
strat_kfold = StratifiedKFold(n_splits=n_folds)
shuff_split = ShuffleSplit(n_splits=n_folds)
kfold_y_props = []
shuff_y_props = []
for (k_train, k_test), (s_train, s_test) in zip(strat_kfold.split(X, y),shuff_split.split(X, y)):
    kfold_y_props.append(y[k_train].mean())
    shuff_y_props.append(y[s_train].mean())

Now, let's plot the proportions over each fold:现在让我们画出每一折的比例:

代码语言:javascript
复制
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(7, 5))
ax.plot(range(n_folds), shuff_y_props, label="ShuffleSplit",color='k')
ax.plot(range(n_folds), kfold_y_props, label="Stratified",color='k', ls='--')
ax.set_title("Comparing class proportions.")
ax.legend(loc='best')

The output will be as follows:输出如下

We can see that the proportion of each fold for stratified k-fold is stable across folds.我们能看到通过k-fold分层后每一折的比例在折的数量上来说是稳定的

How it works...怎么运行的:

Stratified k-fold works by taking the y value. First, getting the overall proportion of the classes,then intelligently splitting the training and test set into the proportions. This will generalize to multiple labels:

k-fold分层通过采取y值运行,首先,得到所有的类别比例,然后明智的办法是分成训练集和测试集设置成相应的比例。这将概化成多种标签:

代码语言:javascript
复制
## 无y值,有问题
import numpy as np
three_classes = np.random.choice([1,2,3], p=[.1, .4, .5],size=1000)
import itertools as it
for train, test in StratifiedKFold(three_classes, 5):
    print(np.bincount(three_classes[train]))
[ 0 90 314 395]
[ 0 90 314 395]
[ 0 90 314 395]
[ 0 91 315 395]
[ 0 91 315 396]

As we can see, we got roughly the sample sizes of each class for our training and testing proportions.

如我们所见,对训练集和测试集的比例,我们为每个类得到大致的样本尺寸。

本文系外文翻译,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文系外文翻译前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
作者已关闭评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档