“哈哈,我们在训练我们的模型并且希望得到更加准确的结果,但基于实际的情况(比如算力、时间),往往会按照一定策略来选择。本文介绍了几种常见的数据集划分与交叉验证的方法策略以及它们的优缺点,主要包括了Train-test-split、k-fold cross-validation、Leave One Out Cross-validation等,包括了代码层的实现与效果的比较,比较适合综合阅读一次。
Model evaluation is a set of procedures allowing you to pick the best possible stable model. It is an essential part of the model development process. It reveals the model’s behavior and its predictive power — indicates the balance between bias and variance on unseen data. As a starting point, split the given dataset into a train and test set. The model will learn to predict using the train set; in comparison, we will utilize the test set to assess the model’s performance.
There are different strategies to split the data and make sure that it is done fairly taking into consideration the special characteristics the attributes could have. For example, you could have biased predictions if the original data has an imbalance between features, so for each case, a specific method might be recommended.
The main methods that would be covered in this article are as the following:
regular train-test split using sklearn — image by the author
It is a way to split the dataset into two halves with a specific percentage. It is easy and quick. It might be appropriate to use when comparing different algorithms to decide which one you might consider.
The train_test_split method within the sklearn. model_selection module is widely utilized to split the original dataset. A common split ratio is 80/20 or 70/30.
“You can split the training set into train and validation set with the same split ratio above — Stackoverflow discussion I did use stratify here because the original dataset has an imbalance in the target class — 500/268.
# set the seed to keep code reducibility
seed = 7# Apply the splitting
x_train, x_test, y_train, y_test = train_test_split(
X, y,
test_size = 0.33,
stratify = y, #to keep class balance during splitting
random_state = seed
)
model = LogisticRegression(max_iter=1000)
model.fit(x_train, y_train)
result = model.score(x_test, y_test)
print(f'accuracy is: {result*100.0: 0.3f}')
“The resulted accuracy is: 76.378
k-fold split procedure — image by the author
To enhance the model accuracy and avoid the regular split of the data disadvantages, we need to add more generalization to the split process. In this strategy, we are repeating the train_test_split multiple times randomly. For each split or fold, the accuracy is calculated then the algorithm aggregate the accuracies calculated from each split and averages them. That way, all the dataset points are involving in measuring the model accuracy, which is better.
For this example, we will use the RepeatedStratifiedKFold() within the sklearn library to assess the model since it repeats stratified folds n-times with a different random scheme in each iteration.
from sklearn.model_selection import RepeatedStratifiedKFold
from scipy.stats import sem
import numpy as np
import matplotlib.pyplot as pltcv_repeated = RepeatedStratifiedKFold(
n_splits = 10,
n_repeats = 16,
random_state = seed
)scores_r = cross_val_score(
model,
X, y,
scoring = 'accuracy',
cv = cv_repeated,
n_jobs = -1
)print('Accuracy: %.3f (%.3f)' % (scores_r.mean(), scores_r.std())))
“The resulted accuracy is: 0.775 (0.042)
It is a good idea to investigate more on the distribution of the estimates for better judgments.
# evaluate a model with a given number of repeats
def asses_model(X, y, repeats):
# define the cv folds
cv = RepeatedStratifiedKFold(
n_splits=10,
n_repeats=repeats,
random_state = seed)
# create the model
model = LogisticRegression(max_iter=1000)
# evaluate model
scores = cross_val_score(
model,
X, y,
scoring = 'accuracy',
cv = cv,
n_jobs=-1)
return scores
Then we will use the sem()
method from the scipy
library to calculate the standard error for each sample.
repeats = range(1, 16)
res = list()
for rep in repeats:
scores = asses_model(X, y, rep)
print('Sample_%d mean=%.4f se=%.3f' % (rep, np.mean(scores), sem(scores)))
res.append(scores)
Let’s visualize the samples accuracies with a boxplot to better understand the results
accuracy across splits — image by the author
The orange line represents the median of the distribution of the accuracy while the green triangle indicates the arithmetic mean.
As demonstrated in the graph above, the model accuracy stabilizes around 6 and 7, which is the number of folds to harness (0.775 (0.042) accuracy).
leave one out cross-validation — image by the author
In this strategy, The algorithm picks a data point for each training fold and excludes it while model training. The validation set hence used to calculate the model the accuracy; then repeat this process for each training fold. The final accuracy has been calculated after averaging each fold accuracy.
“In this strategy, we create n-1 models for n observations in the data.
from sklearn.model_selection import LeaveOneOut
loocv = LeaveOneOut()
model = LogisticRegression(max_iter=1000)
res = cross_val_score(model, X, y, cv = loocv)
print('Accuracy: %.3f (%.3f)' % (np.mean(res), np.std(res)))
“The resulted accuracy is: 0.776 (0.417)
The best way to test whether to use LOOCV or not is to run KFold-CV with a large k value — consider 25 or 50, and gauge how long it would take to train the model.
We explored the most common strategies to train the model in machine learning effectively. Each method has its pros and cons; however, there are some tips that we may consider when choosing one.