[012] 不同数据集划分与验证方法的实现与比较

“哈哈，我们在训练我们的模型并且希望得到更加准确的结果，但基于实际的情况（比如算力、时间），往往会按照一定策略来选择。本文介绍了几种常见的数据集划分与交叉验证的方法策略以及它们的优缺点，主要包括了Train-test-split、k-fold cross-validation、Leave One Out Cross-validation等，包括了代码层的实现与效果的比较，比较适合综合阅读一次。

What is Model evaluation?

Model evaluation is a set of procedures allowing you to pick the best possible stable model. It is an essential part of the model development process. It reveals the model’s behavior and its predictive power — indicates the balance between bias and variance on unseen data. As a starting point, split the given dataset into a train and test set. The model will learn to predict using the train set; in comparison, we will utilize the test set to assess the model’s performance.

1. Train-test split
2. k-fold cross-validation, K-Fold
3. Leave One Out Cross-validation, LOOCV

Methods used for splitting

There are different strategies to split the data and make sure that it is done fairly taking into consideration the special characteristics the attributes could have. For example, you could have biased predictions if the original data has an imbalance between features, so for each case, a specific method might be recommended.

The main methods that would be covered in this article are as the following:

1. Train-test split
2. k-fold cross-validation, K-Fold
3. Leave One Out Cross-validation, LOOCV

Train test split

regular train-test split using sklearn — image by the author

It is a way to split the dataset into two halves with a specific percentage. It is easy and quick. It might be appropriate to use when comparing different algorithms to decide which one you might consider.

The train_test_split method within the sklearn. model_selection module is widely utilized to split the original dataset. A common split ratio is 80/20 or 70/30.

“You can split the training set into train and validation set with the same split ratio above — Stackoverflow discussion I did use stratify here because the original dataset has an imbalance in the target class — 500/268.

```# set the seed to keep code reducibility
seed = 7# Apply the splitting
x_train, x_test, y_train, y_test = train_test_split(
X, y,
test_size = 0.33,
stratify = y, #to keep class balance during splitting
random_state = seed
)
model = LogisticRegression(max_iter=1000)
model.fit(x_train, y_train)
result = model.score(x_test, y_test)
print(f'accuracy is: {result*100.0: 0.3f}')
```

The resulted accuracy is: 76.378

Pros:

• Easy to implement
• Quick execution, less computation time

Cons:

• Inaccurate accuracy if the split is not random
• Might be a cause for underfitting if the original dataset has limited data points.

K-fold cross-validation

k-fold split procedure — image by the author

To enhance the model accuracy and avoid the regular split of the data disadvantages, we need to add more generalization to the split process. In this strategy, we are repeating the train_test_split multiple times randomly. For each split or fold, the accuracy is calculated then the algorithm aggregate the accuracies calculated from each split and averages them. That way, all the dataset points are involving in measuring the model accuracy, which is better.

For this example, we will use the RepeatedStratifiedKFold() within the sklearn library to assess the model since it repeats stratified folds n-times with a different random scheme in each iteration.

```from sklearn.model_selection import RepeatedStratifiedKFold
from scipy.stats import sem
import numpy as np
import matplotlib.pyplot as pltcv_repeated = RepeatedStratifiedKFold(
n_splits = 10,
n_repeats = 16,
random_state = seed
)scores_r = cross_val_score(
model,
X, y,
scoring = 'accuracy',
cv = cv_repeated,
n_jobs = -1
)print('Accuracy: %.3f (%.3f)' % (scores_r.mean(), scores_r.std())))
```

The resulted accuracy is: 0.775 (0.042)

Accessing the model accuracies across each fold

It is a good idea to investigate more on the distribution of the estimates for better judgments.

```# evaluate a model with a given number of repeats
def asses_model(X, y, repeats):
# define the cv folds
cv = RepeatedStratifiedKFold(
n_splits=10,
n_repeats=repeats,
random_state = seed)
# create the model
model = LogisticRegression(max_iter=1000)
# evaluate model
scores = cross_val_score(
model,
X, y,
scoring = 'accuracy',
cv = cv,
n_jobs=-1)
return scores
```

Then we will use the `sem()` method from the `scipy` library to calculate the standard error for each sample.

```repeats = range(1, 16)
res = list()
for rep in repeats:
scores = asses_model(X, y, rep)
print('Sample_%d mean=%.4f se=%.3f' % (rep, np.mean(scores), sem(scores)))
res.append(scores)
```

Let’s visualize the samples accuracies with a boxplot to better understand the results

accuracy across splits — image by the author

The orange line represents the median of the distribution of the accuracy while the green triangle indicates the arithmetic mean.

As demonstrated in the graph above, the model accuracy stabilizes around 6 and 7, which is the number of folds to harness (0.775 (0.042) accuracy).

Pros:

• Higher accuracy
• Handles class imbalances better.
• less probability of underfitting

cons:

• More prone to overfitting, so we need to monitor the accuracies across folds.
• High computational power and more execution time.

Leave-One-Out Cross-validation

leave one out cross-validation — image by the author

In this strategy, The algorithm picks a data point for each training fold and excludes it while model training. The validation set hence used to calculate the model the accuracy; then repeat this process for each training fold. The final accuracy has been calculated after averaging each fold accuracy.

In this strategy, we create n-1 models for n observations in the data.

```from sklearn.model_selection import LeaveOneOut
loocv = LeaveOneOut()
model = LogisticRegression(max_iter=1000)
res = cross_val_score(model, X, y, cv = loocv)
print('Accuracy: %.3f (%.3f)' % (np.mean(res), np.std(res)))
```

The resulted accuracy is: 0.776 (0.417)

Pros:

• Very efficient if the dataset is limited — since we want to use as much training data as possible when fitting the model.
• It has the best error estimate possible for a single new data point.

cons:

• Computationally expensive.
• If the dataset is large in size.
• If testing a lot of different parameter sets.

The best way to test whether to use LOOCV or not is to run KFold-CV with a large k value — consider 25 or 50, and gauge how long it would take to train the model.

Takeaways and Closing notes

We explored the most common strategies to train the model in machine learning effectively. Each method has its pros and cons; however, there are some tips that we may consider when choosing one.

1. K-fold cross-validation is a rule of thumb for comparing different algorithms’ performance — most k-fold is 3, 5, and 10.
2. Start with the regular train test split to have a ground truth of a specific algorithm’s estimated performance.
3. Leave one out cross-validation — LOOCV is a deterministic estimation, where there is no sampling on the training dataset. On the other hand, other strategies follow a stochastic estimate.
4. LOOCV might be appropriate when you need an accurate estimate of the performance.

0 条评论

• “备份集中的数据库备份与现有的数据库不同”解决方法

最主要就是要在“选项”中选择“覆盖现有数据库”，否则就会出现“备份集中的数据库备份与现有的数据库”的问题。

• 【Tensorflow】Dataset 中的 Iterator

Tensorflow 现在将 Dataset 作为首选的数据读取手段，而 Iterator 是 Dataset 中最重要的概念。这篇文章的目的是，以官网文档为基...

• 常见Bean拷贝框架下划线驼峰互转扩展支持

上一篇博文常见Bean拷贝框架使用姿势及性能对比 介绍了几种bean拷贝框架的使用姿势以及性能对比，主要适用的是属性名一致、类型一致的拷贝，在实际的业务开发中，...

• 常见Bean拷贝框架下划线驼峰互转扩展支持

上一篇博文常见Bean拷贝框架使用姿势及性能对比 介绍了几种bean拷贝框架的使用姿势以及性能对比，主要适用的是属性名一致、类型一致的拷贝，在实际的业务开发中，...

• 数据科学系列：sklearn库主要模块功能简介

作为一名数据分析师，当我初次接触数据分析三剑客（numpy、pandas、matplotlib）时，感觉每个库的功能都很多很杂，所以在差不多理清了各模块功能后便...

• 非比较排序算法总结与实现

之前一篇文章介绍了几种常用的比较排序算法，下面介绍的是几种非比较排序算法。 非比较排序算法内部引用的都是计数排序，当然你也可以将计数排序换为其他的比较排序算法。...

• 面试官：你说说一条查询SQL的执行过程？| 文末送书

为了理解这个问题，先从Mysql的架构说起，对于Mysql来说，大致可以分为3层架构。

• 适合破解新手的160个crackme练手之04

适合破解新手的160个crackme练手之04 老规矩，先运行程序看看是干嘛的，可以看到程序有点怪异，没有确定按钮，后面知道了原来点击中间阴影部分就是验证功能啦...

• 集群JournalNode服务重启导致NameNode挂掉分析

在我们的集群中修改了JournalNode服务的配置后需要重启时配置生效，在进行重启操作时导致NameNode服务挂掉，具体操作步骤如下：

• （数据科学学习手札08）系统聚类法的Python源码实现（与Python，R自带方法进行比较）

聚类分析是数据挖掘方法中应用非常广泛的一项，而聚类分析根据其大体方法的不同又分为系统聚类和快速聚类，其中系统聚类的优点是可以很直观的得到聚类数不同时具体类中包括...

• Datawhale 零基础入门CV赛事-Task4 模型训练与验证

为此本章将从构建验证集、模型训练和验证、模型保存与加载和模型调参几个部分讲解，在部分小节中将会结合Pytorch代码进行讲解。

• 数据集划分的三种常见方式！

数据集划分算是在数据分析建模中比较重要的，模型的好坏不但和训练数据有关，还和测试数据有关，当然，也和评估指标有关，不过今天先来看前者。

• 抓包工具 tcpdump 用法说明

tcpdump采用命令行方式对接口的数据包进行筛选抓取，其丰富特性表现在灵活的表达式上。

• 抓包工具tcpdump用法说明

tcpdump采用命令行方式对接口的数据包进行筛选抓取，其丰富特性表现在灵活的表达式上。

• 安全漏洞公告

1 Apache Struts ParametersInterceptor任意代码执行漏洞 Apache Struts ParametersIntercepto...

• MySql数据库Update批量更新与批量更新多条记录的不同值实现方法

这里注意 ‘other_values' 是一个逗号（，）分隔的字符串，如：1,2,3

• 模型评估、过拟合欠拟合以及超参数调优方法

上一篇文章介绍了性能评估标准，但如何进行模型评估呢，如何对数据集进行划分出训练集、验证集和测试集呢？如何应对可能的过拟合和欠拟合问题，还有超参数的调优，如何更好...

• TensorFlow系列专题（二）：机器学习基础

数据预处理的方式较多，针对不同类型的数据，预处理的方式和内容也不尽相同，这里我们简单介绍几种较为常用的方式：