网格搜索(grid search),作为调参很常用的方法,这边还是要简单介绍一下。
在我们的机器学习算法中,有一类参数,需要人工进行设定,我们称之为“超参”,也就是算法中的参数,比如学习率、正则项系数或者决策树的深度等。
网格搜索就是要找到一个最优的参数,从而使得模型的效果最佳,而它实现的原理其实就是暴力搜索;即我们事先为每个参数设定一组值,然后穷举各种参数组合,找到最好的那一组。
1. 两层for循环暴力检索:
网格搜索的结果获得了指定的最优参数值,c为100,gamma为0.001
1# naive grid search implementation
2from sklearn.datasets import load_iris
3from sklearn.svm import SVC
4from sklearn.model_selection import train_test_split
5iris = load_iris()
6X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)
7print("Size of training set: %d size of test set: %d" % (X_train.shape[0], X_test.shape[0]))
8best_score = 0
9for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:
10 for C in [0.001, 0.01, 0.1, 1, 10, 100]:
11 # for each combination of parameters
12 # train an SVC
13 svm = SVC(gamma=gamma, C=C)
14 svm.fit(X_train, y_train)
15 # evaluate the SVC on the test set
16 score = svm.score(X_test, y_test)
17 # if we got a better score, store the score and parameters
18 if score > best_score:
19 best_score = score
20 best_parameters = {'C': C, 'gamma': gamma}
21print("best score: ", best_score)
22print("best parameters: ", best_parameters)
output:
Size of training set: 112 size of test set: 38
best score: 0.973684210526
best parameters: {'C': 100, 'gamma': 0.001}
2. 构建字典暴力检索:
网格搜索的结果获得了指定的最优参数值,c为1
1from sklearn.svm import SVC
2from sklearn.model_selection import GridSearchCV
3pipe_svc = Pipeline([('scl', StandardScaler()),
4 ('clf', SVC(random_state=1))])
5param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
6param_grid = [{'clf__C': param_range,
7 'clf__kernel': ['linear']},
8 {'clf__C': param_range,
9 'clf__gamma': param_range,
10 'clf__kernel': ['rbf']}]
11gs = GridSearchCV(estimator=pipe_svc,
12 param_grid=param_grid,
13 scoring='accuracy',
14 cv=10,
15 n_jobs=-1)
16gs = gs.fit(X_train, y_train)
17print(gs.best_score_)
18print(gs.best_params_)
output:
0.978021978022
{'clf__C': 0.1, 'clf__kernel': 'linear'}
GridSearchCV中param_grid参数是字典构成的列表。对于线性SVM,我们只评估参数C;对于RBF核SVM,我们评估C和gamma。最后, 我们通过best_parmas_得到最优参数组合。
接着,我们直接利用最优参数建模(best_estimator_):
1clf = gs.best_estimator_
2clf.fit(X_train, y_train)
3print('Test accuracy: %.3f' % clf.score(X_test, y_test))
网格搜索虽然不错,但是穷举过于耗时,sklearn中还实现了随机搜索,使用 RandomizedSearchCV类,随机采样出不同的参数组合。
3. 参考文献
https://blog.csdn.net/cymy001/article/details/78578665
2. python机器学习库sklearn——参数优化(网格搜索GridSearchCV、随机搜索RandomizedSearchCV、hyperopt)
https://blog.csdn.net/luanpeng825485697/article/details/79831703
—End—