前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >机器学习实战 | 第四章:模型验证和选择

机器学习实战 | 第四章:模型验证和选择

作者头像
用户1332428
发布2018-03-08 15:22:29
1.4K0
发布2018-03-08 15:22:29
举报
文章被收录于专栏:人工智能LeadAI人工智能LeadAI

模型选择和评估主要是在sklearn.model_selection这个模块里面.这里只会列出概述和常见函数的用法,更加详细的可以到sklearn.model_selection: Model Selection (http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)来看。

概览

Splitter Classes

model_selection.KFold([n_splits, shuffle, …]) K-Folds cross-validator model_selection.GroupKFold([n_splits]) K-fold iterator variant with non-overlapping groups. model_selection.StratifiedKFold([n_splits, …]) Stratified K-Folds cross-validator model_selection.LeaveOneGroupOut() Leave One Group Out cross-validator model_selection.LeavePGroupsOut(n_groups) Leave P Group(s) Out cross-validator model_selection.LeaveOneOut() Leave-One-Out cross-validator model_selection.LeavePOut(p) Leave-P-Out cross-validator model_selection.ShuffleSplit([n_splits, …]) Random permutation cross-validator model_selection.GroupShuffleSplit([…]) Shuffle-Group(s)-Out cross-validation iterator model_selection.StratifiedShuffleSplit([…]) Stratified ShuffleSplit cross-validator model_selection.PredefinedSplit(test_fold) Predefined split cross-validator model_selection.TimeSeriesSplit([n_splits]) Time Series cross-validator

分割函数

model_selection.train_test_split(*arrays, …) 把数组或者矩阵随机划分为子训练集和子测试集. model_selection.check_cv([cv, y, classifier]) Input checker utility for building a cross-validator

超参数优化器

model_selection.GridSearchCV(estimator, …) Exhaustive search over specified parameter values for an estimator. model_selection.RandomizedSearchCV(…[, …]) Randomized search on hyper parameters. model_selection.ParameterGrid(param_grid) Grid of parameters with a discrete number of values for each. model_selection.ParameterSampler(…[, …]) Generator on parameters sampled from given distributions. model_selection.fit_grid_point(X, y, …[, …]) Run fit on one set of parameters.

Model validation

model_selection.cross_val_score(estimator, X) :通过交叉验证生成模型得分 model_selection.cross_val_predict(estimator, X) Generate cross-validated estimates for each input data point model_selection.permutation_test_score(…) Evaluate the significance of a cross-validated score with permutations model_selection.learning_curve(estimator, X, y) Learning curve. model_selection.validation_curve(estimator, …) Validation curve

分割函数

函数原型:

sklearn.model_selection.train_test_split(*arrays, **options)

作用: 把数组或者矩阵随机划分为子训练集和子测试集.返回的是一个列表,列表的长度是arrays这个长度的两倍(因为要分别划分出一个训练集和测试集,自然增长了两倍).要是输入时稀疏(sparse)的,那么输出就会是scipy.sparse.csr_matrix类型,不然输出类型和输入的类型是一样的. 参数: *arrays : sequence of indexables with same length / shape[0] 允许的输入可以使lists,ndarray,scipy-sparse matrices或者是pandas的dataframe test_size : float, int, or None类型 (默认是None), 如果是float类型, 应该介于0.0和1.0之间,表示数据集划分到测试集中的比例 >>如果是int类型,表示测试集样本的数量. 要是为None, 就自动根据train_size的值来进行补全,要是train_size也是None,那么test_size就被设置为0.25 train_size : float, int, or None类型 (默认是None), 如果是float类型, 应该介于0.0和1.0之间,表示数据集划分到训练集中的比例 >>如果是int类型,表示训练集样本的数量. 要是为None, 就自动根据test_size的值来进行补全 random_state : int or RandomState 伪随机数生成器,用来进行随机采样. stratify : array-like or None (default is None) If not None, data is split in a stratified fashion, using this as the class labels.

例子:

代码语言:javascript
复制
1.import numpy as npfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import load_boston

2.boston=load_boston()
3.dataSet=boston.data
4.labels=boston.target

5.print(dataSet.shape)
6.print(labels.shape)

7.splited=train_test_split(dataSet,labels,test_size=0.3)
8.print("elements in splited:\n",len(splited))
9.print("\n")

10.print("dataSet split into:")
11.print(splited[0].shape)
12.print(splited[1].shape)

13.print("labels split into:")
14.print(splited[2].shape)
15.print(splited[3].shape)

结果:

通过结果可以很清楚的看到”分割”之后的形状特征.

模型评分

Ⅰ.sklearn.model_selection.cross_val_score

sklearn.model_selection.cross_val_score(estimator,X,y=None,groups=None,scoring=None,cv=None,n_jobs=1,verbose=0,fit_params=None, pre_dispatch=’2*n_jobs’)

通过交叉验证来评估分数.返回交叉验证评估的分数.返回值是array类型,形状为(len(list(cv)),) 参数: estimator : 实现了”fit”的”估计”对象,用来拟合数据.其实就是相应的分类器或者是回归器对象. X : array,待fit的数据. y : array-like,可选, 默认为: None 其实就是集合相对应的标签., groups : array-like, with shape (n_samples,), optional Group labels for the samples used while splitting the dataset into train/test set. scoring : 字符串或者可调用的对象.可选,默认为None. cv : 整形,交叉验证生成器,或者是一个可以迭代的类型. 可选.这个参数决定了交叉验证的分裂策略.可能的输入方式有: None:使用默认的3折交叉验证. 某个整数: 指明了多少折交叉验证. 用来作为交叉验证生成器的某个对象. n_jobs : 整形,可选.表示用来计算的CPU的数量.当设为-1的时候,表示使用所有的CPU. verbose : integer, optional The verbosity level. fit_params : dict, optional Parameters to pass to the fit method of the estimator. pre_dispatch : int, or string, optional Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs An int, giving the exact number of total jobs that are spawned A string, giving an expression as a function of n_jobs, as in ‘2*n_jobs’

这个函数是一个很常见的给模型选择的函数.这里通过自带的boston房价数据集和Rigde回归模型来简单的举一个使用这个函数的例子.

例1:

代码语言:javascript
复制
1.import numpy as np

这里选择的是alpha=1.0的岭回归算法.采用10折交叉验证计算损失.所以,将返回一个10维的数组,每个维度表示原数据集其中的某一份做验证集时的损失.

结果:

在实际使用中,我们都是把这些损失值的平均值作为最后在这整个数据集上面的损失.

这里再举一个例子,看看岭回归的参数选择对于结果的影响.

例二:

代码语言:javascript
复制
1.import numpy as np

更加深刻一点,可以加入random forest来和岭回归对比看一下.

代码语言:javascript
复制
1.import numpy as np

结果:

Ⅱ. Ⅲ. Ⅳ.

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2017-09-06,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 人工智能LeadAI 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 超参数优化器
  • 分割函数
    • Ⅰ.sklearn.model_selection.cross_val_score
    领券
    问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档