文章/答案/技术大牛

发布

Python基础——6 交叉验证法

文章来源：企鹅号 - 319小课堂

数字记录的是过去，分析憧憬的是未来。

前言

基本思想

将原始数据进行分组，一部分做为训练集来训练模型，另一部分做为测试集来评价模型。

优点

交叉验证用于评估模型的预测性能，尤其是训练好的模型在新数据上的表现，可以在一定程度上减小过拟合。

还可以从有限的数据中获取尽可能多的有效信息。

k折交叉验证法

k 折交叉验证通过对 k 个不同分组训练的结果进行平均来减少方差，因此模型的性能对数据的划分就不那么敏感。

步骤

（1）不重复抽样将原始数据随机分为 k 份。

（2）每一次挑选其中 1 份作为测试集，剩余 k-1 份作为训练集用于模型训练。（3）重复第二步 k 次，这样每个子集都有一次机会作为测试集，其余机会作为训练集。在每个训练集上训练后得到一个模型，用这个模型在相应的测试集上测试，计算并保存模型的评估指标。

（4）计算 k 组测试结果的平均值作为模型精度的估计，并作为当前 k 折交叉验证下模型的性能指标。

6.1 交叉验证法

in [1]:

import pandas as pd

import numpy as np

df=pd.read_csv('wdbc.csv',header=None)

df.head()

out [1]:

数据的前两列分别存储了样本唯一的ID以及样的诊断结果（M代表恶性，B代表良性）。数据集的2~31列包含了30个从细胞核照片中提取、用实数值标识的特征，它们可以用于构建判定模型，对肿瘤是良性还是恶性做出预测。

in [2]:

import pandas as pd

import numpy as np

df=pd.read_csv('wdbc.csv',header=None)

df.head()

from sklearn.preprocessing import LabelEncoder

X=df.loc[:,2:].values

Y=df.loc[:,1].values

le=LabelEncoder() #将字符串转换为整数

Y=le.fit_transform(Y)

from sklearn import cross_validation

X_train,X_test,Y_train,Y_test= cross_validation.train_test_split(X,Y,test_size=0.15,random_state=1)

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

from sklearn.linear_model import LogisticRegression

from sklearn.pipeline import Pipeline

pipe_lr=Pipeline([('scl',StandardScaler()),('pca',PCA(n_components=2)),('clf',LogisticRegression(random_state=1))])

pipe_lr.fit(X_train,Y_train)

print('Test Accuracy:%.3f'%pipe_lr.score(X_test,Y_test))

out [2]:

Test Accuracy:0.942

6.2 k折交叉验证法

K折交叉验证法：不重复地随机讲训练数据集划分为k个，其中k-1个用于模型训练，剩余一个用于测试。重复此过程k次，得到k个模型及对模型性能的评价

in [3]:

from sklearn.cross_validation import StratifiedKFold

kfold=StratifiedKFold(Y_train,n_folds=10,random_state=1)

scores=[]

for k,(train,test) in enumerate(kfold):

pipe_lr.fit(X_train[train],Y_train[train])

score=pipe_lr.score(X_train[test],Y_train[test])

scores.append(score)

print('Fold:%s,Class dist.:%s,Acc:%.6f'%(k+1,np.bincount(Y_train[train]),score))

Out [3]:

重复之前的步骤，读取文件后分类计算均值，再整合。

之后通过autocorrela_plot函数绘制自相关图。

in [4]:

print('CV accuracy:%.3f+/-%.3f'%(np.mean(scores),np.std(scores)))

out [4]:

CV accuracy:0.953+/-0.016

in [5]:

np.arange(5)

out [5]:

array([0, 1, 2, 3, 4])

in [6]:

np.bincount(np.arange(5))

out[6]:

array([1, 1, 1, 1, 1], dtype=int64)

in [7]:

np.bincount(np.array([0, 1, 1, 3, 2, 1, 7]))

out [7]:

array([1, 3, 1, 1, 0, 0, 0, 1], dtype=int64)

in [8]:

w = np.array([0.3, 0.5, 0.2, 0.7, 1, -0.6])# weights

x = np.array([0, 1, 1, 2, 2, 2])

np.bincount(x,weights=w)

out [8]:

array([ 0.3, 0.7, 1.1])

in [9]:

from sklearn.cross_validation import cross_val_score scores=cross_val_score(estimator=pipe_lr,X=X_train,y=Y_train,cv=10,n_jobs=1)

print('CV accuracy scores:%s'%scores)

out [9]:

in [10]:

print('CV accuracy scores:%.3f +/- %.3f'%(np.mean(scores),np.std(scores)))

out [10]:

CV accuracy scores:0.953 +/- 0.016

in [11]:

import matplotlib.pyplot as plt

from sklearn.learning_curve import learning_curve

pipe_lr=Pipeline([('scl',StandardScaler()),('clf',LogisticRegression(random_state=1))])

train_sizes,train_scores,test_scores=learning_curve(estimator=pipe_lr, X=X_train, y=Y_train, train_sizes=np.linspace(0.1,1,10), cv=10, n_jobs=1)

train_mean=np.mean(train_scores,axis=1) #axis=1，求行均值

train_std=np.std(train_scores,axis=1)

test_mean=np.mean(test_scores,axis=1)

test_std=np.std(test_scores,axis=1)

plt.plot(train_sizes,train_mean,color='blue',marker='o',markersize=5,label='training accuracy') #color：颜色；marker：形状；markersize：形状的大小；label：标签

plt.fill_between(train_sizes,train_mean+train_std,train_mean-train_std,alpha=0.15,color='blue') #fill_between函数加入了平均准确率标准差的信息，用以表示评价结果的方差

plt.plot(train_sizes,test_mean,color='green',linestyle='--',marker='s',markersize=5,label='vali dation accuracy')

plt.fill_between(train_sizes,test_mean+test_std,test_mean-test_std,alpha=0.15,color='green')

plt.grid()

plt.xlabel('Number of training samples')

plt.ylabel('Accuracy')

plt.legend(loc='lower left')

plt.ylim([0.8,1.0])

plt.show()

out [11]:

关于封面

p值是指在一个概率模型中，统计摘要（如两组样本均值差）与实际观测数据相同，或甚至更大这一事件发生的概率。换言之，是检验假设零假设成立或表现更严重的可能性。p值若与选定显著性水平（0.05或0.01）相比更小，则零假设会被否定而不可接受。然而这并不直接表明原假设正确。p值是一个服从正态分布的随机变量，在实际使用中因样本等各种因素存在不确定性。产生的结果可能会带来争议。

作者：贺涵镜

编辑：刘皓昀

下期预告：Python基础——7 时间序列

这一期做的好啵?

想关注不啦？

关注一下撒

发表于: 2018-06-232018-06-23 08:01:12
原文链接：https://kuaibao.qq.com/s/20180623G09ZR500?refer=cp_1026
腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号（企鹅号）传播渠道之一，根据《腾讯内容开放平台服务协议》转载发布内容。
如有侵权，请联系 cloudcommunity@tencent.com 删除。

扫码

添加站长进交流群

领取专属 10元无门槛券

私享最新 技术干货

Python基础——6 交叉验证法

相关快讯

扫码

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐