在scikit中,当您使用GridSearchCV或RandomizedSearchCV并设置n_jobs=-1时,学习0.24.0或更高版本,设置任何详细数字(1、2、3或100)的将不会被打印任何进度消息。但是,如果您使用了scikit--学习0.23.2或更低的版本,一切都按预期的方式工作,并打印进度消息。
下面是一个示例代码,您可以使用它来重复我在Google或木星笔记本中的实验:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[0.1, 1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters, scoring='accuracy', refit=True, n_jobs=-1, verbose=60)
clf.fit(iris.data, iris.target)
print('Best accuracy score: %.2f' %clf.best_score_)
使用scikit的结果-学习0.23.2:
Fitting 5 folds for each of 6 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 40 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.0s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0295s.) Setting batch_size=2.
[Parallel(n_jobs=-1)]: Done 2 out of 30 | elapsed: 0.0s remaining: 0.5s
[Parallel(n_jobs=-1)]: Done 3 out of 30 | elapsed: 0.0s remaining: 0.3s
[Parallel(n_jobs=-1)]: Done 4 out of 30 | elapsed: 0.0s remaining: 0.3s
[Parallel(n_jobs=-1)]: Done 5 out of 30 | elapsed: 0.0s remaining: 0.2s
[Parallel(n_jobs=-1)]: Done 6 out of 30 | elapsed: 0.0s remaining: 0.2s
[Parallel(n_jobs=-1)]: Done 7 out of 30 | elapsed: 0.0s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 8 out of 30 | elapsed: 0.0s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 9 out of 30 | elapsed: 0.0s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 10 out of 30 | elapsed: 0.0s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 11 out of 30 | elapsed: 0.0s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 12 out of 30 | elapsed: 0.0s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 13 out of 30 | elapsed: 0.0s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 14 out of 30 | elapsed: 0.0s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 15 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 16 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 17 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 18 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 19 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 20 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 21 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 22 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 23 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 24 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 25 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 26 out of 30 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 27 out of 30 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 28 out of 30 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 30 out of 30 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 30 out of 30 | elapsed: 0.1s finished
Best accuracy score: 0.98
使用scikit的结果-学习0.24.0 (测试到v1.0.2):
Fitting 5 folds for each of 6 candidates, totaling 30 fits
Best accuracy score: 0.98
在我看来,scikit-learn 0.24.0或更高版本并没有向joblib
发送“详细”值,因此,当多处理器在GridSearch或RandomizedSearchCV中使用"loky“后端时,进度并不是打印。
您知道如何在Google或朱庇特笔记本中解决这个问题吗?如何将进度日志打印出来用于sklearn 0.24.0或更高版本?
发布于 2022-06-20 23:58:54
这里有一个迂回的方法来获取GridSearchCV行为,并在Google中进行进度打印。它需要适应RandomSearchCV的行为。
这需要创建培训、验证和测试集。我们将使用验证集来测试多个模型,并保存测试集以测试最终的最佳模型。
import gc
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from tqdm import tqdm
from sklearn.neighbors import KernelDensity
from scipy import stats
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, accuracy_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, ParameterGrid
# This is based on the target and features from my dataset
y = relationships["tmrca"]
X = relationships.drop(columns = ["sample1", "sample2", "total_span_cM", "max_span_cM", "relationship", "tmrca"])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, test_size=0.25, random_state=42)
print(f"X_train size: {len(X_train):,} \nX_validation size: {len(X_validation):,} \nX_test size: {len(X_test):,}")
在这里,我们定义了方法。
def random_forest_tvt(para_grid, seed):
# grid search for the hyperparameters like n_estimators, max_leaf_nodes, etc.
# fit model on training set, tune paras on validation set, save best paras
error_min = 1
count = 0
clf = RandomForestClassifier(n_jobs=-1, random_state=seed)
num_fits = len(ParameterGrid(para_grid))
with tqdm(total=num_fits, desc=f"Trying the models for the best fit...", file=sys.stdout) as fit_pbar:
for g in ParameterGrid(para_grid):
count += 1
print(f"\n{g}")
clf.set_params(**g)
clf.fit(X_train, y_train)
y_predict_validation = clf.predict(X_validation)
accuracy_measure = accuracy_score(y_validation, y_predict_validation)
error_validation = 1 - accuracy_measure
print(f"The accuracy is {accuracy_measure * 100:.2f}%.\n")
if(error_validation < error_min):
error_min = error_validation
best_para = g
fit_pbar.update()
# fitting the model on the best parameters for method output
clf.set_params(**best_para)
clf.fit(X_train, y_train)
y_predict_train = clf.predict(X_train)
score_train = accuracy_score(y_train, y_predict_train)
y_predict_validation = clf.predict(X_validation)
score_validation = accuracy_score(y_validation, y_predict_validation)
return(best_para, score_train, score_validation)
然后定义参数网格并调用该方法。
seed = 0
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 1000, stop = 5000, num = 3)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 3)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True]
# Parameter Grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
print(f"The parameter grid\n{random_grid}\n")
best_parameters, score_train, score_validation = random_forest_tvt(random_grid, seed)
print(f"\n === Random Forest ===\n Best parameters are: {best_parameters} \n training score: {score_train * 100:.2f}%, validation error: {score_validation * 100:.2f}.")
下面是在Google中输出的第5个fit结果,而这个方法还在运行。
The parameter grid
{'n_estimators': [1000, 3000, 5000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 60, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True]}
Trying the models for the best fit...: 0%| | 0/216 [00:00<?, ?it/s]
{'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 1000}
The accuracy is 85.13%.
Trying the models for the best fit...: 0%| | 1/216 [00:16<58:27, 16.31s/it]
{'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 3000}
The accuracy is 85.13%.
Trying the models for the best fit...: 1%| | 2/216 [01:05<2:06:44, 35.53s/it]
{'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 5000}
The accuracy is 85.10%.
Trying the models for the best fit...: 1%|▏ | 3/216 [02:40<3:42:34, 62.70s/it]
{'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 1000}
The accuracy is 85.15%.
Trying the models for the best fit...: 2%|▏ | 4/216 [02:56<2:36:00, 44.15s/it]
{'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 3000}
The accuracy is 85.14%.
Trying the models for the best fit...: 2%|▏ | 5/216 [03:43<2:39:13, 45.28s/it]
然后可以使用best_paramters
进行进一步的微调或调用测试集上的预测方法。
best_grid = RandomForestClassifier(n_jobs=-1, random_state=seed)
best_grid.set_params(**best_parameters)
best_grid.fit(X_train, y_train)
y_predict_test = best_grid.predict(X_test)
score_test = accuracy_score(y_test, y_predict_test)
print(f"{score_test:.2f}%")
您需要做进一步的调整,以使它做k折叠行为。就目前情况而言,每个模型将在火车上测试一次,在验证集上测试一次,每个模型总共测试两次。然后对具有最佳参数的模型进行第三次测试,得到输出结果。最后,您可以使用输出参数进行进一步的微调(此处未显示)或调用测试集上的预测方法。
https://stackoverflow.com/questions/70745877
复制相似问题