# R语言模拟：Cross Validation

## 前两篇在理论推导和模拟的基础上，对于误差分析中的偏差方差进行了分析。本文在前文的基础上，分析一种常用的估计预测误差进而可以参数优化的方法：交叉验证，并通过R语言进行模拟。

K-FOLD CV

ESL 7.10.2中提到应用CV的两种方法，比如对于一个包含多个自变量的分类模型，建模中包括两方面，一个是筛选出预测能力强的变量，一个是估计最佳的参数。因此有两种使用CV的方法（以下内容摘自ESL 7.10.2）

1

1.Screen the predictors: find a subset of “good” predictors that show fairly strong (univariate) correlation with the class labels

2.Using just this subset of predictors, build a multivariate classifier.

3.Use cross-validation to estimate the unknown tuning parameters and to estimate the prediction error of the final model.

2

1. Divide the samples into K CV folds(group) at random.

2. For each fold k = 1,2,...,K

(a) Find a subset of "good" predictors that show fairly strong correlation with the class labels, using all of the samples except those in fold k.

(b) Using just this subset of predictors,build a multivariate classifier,using all of the samples except those in fold K.

(c) Use the classifier to predict rhe class labels for the samples in fold k.

（The problem is that the predictors have an unfair advantage, as they were chosen in step (1) on the basis of all of the samples. Leaving samples out after the variables have been selected does not correctly mimic the application of the classifier to a completely independent test set, since these predictors “have already seen” the left out samples.）

Y = ifelse(X1+X2+...+X10>5,1,0)

1. CV估计的误差与实际预测误差的数值大小、趋势变化基本相同，并且真实预测误差基本都在CV1标准误以内。
2. p = 10附近误差最小，表明参数p应设定为10，这也与样本因变量定义相符。

```setwd('xxxx')
library(leaps)
library(DAAG)
library(caret)
lm.BestSubSet<- function(trainset,k){
lm.sub <- regsubsets(V21~.,trainset,nbest =1,nvmax = 20)
summary(lm.sub)
coef_lm <- coef(lm.sub,k)
strings_coef_lm <- coef_lm
x <- paste(names(coef_lm)[2:length(coef_lm)], collapse ='+')
formulas <- as.formula(paste('V21~',x,collapse=''))
return(formulas)
}

# ====================   get error ===============================
getError <- function(k,num,modeltype,seeds,n_test){
set.seed(seeds)
testset <- as.data.frame(matrix(runif(n_test*21,0,1),n_test))

Allfx_hat <- matrix(0,n_test,num)
Ally <- matrix(0,n_test,num)
Allfx <- matrix(0,n_test,num)

# 模拟 num次
for (i in 1:num){
trainset <- as.data.frame(matrix(runif(80*21,0,1),80))
fx_train <- ifelse(trainset[,1] +trainset[,2] +trainset[,3] +trainset[,4] +trainset[,5]+
trainset[,6] +trainset[,7] +trainset[,8] +trainset[,9] +trainset[,10]>5,1,0)
trainset[,21] <- fx_train
fx_test <- ifelse(testset[,1] +testset[,2] +testset[,3] +testset[,4] +testset[,5]+
testset[,6] +testset[,7] +testset[,8] +testset[,9] +testset[,10]>5,1,0)

testset[,21] <- fx_test

# best subset
lm.sub <- lm(formula = lm.BestSubSet(trainset,k),trainset)
probs <- predict(lm.sub,testset[,1:20], type = 'response')

Allfx_hat[,i] <- probs
Ally[,i] <- testset[,21]
Allfx[,i] <- fx_test

}
# 计算方差、偏差等

# irreducible <- sigma^2

irreducible  <- mean(apply( Allfx - Ally  ,1,var))
SquareBais  <- mean(apply((Allfx_hat - Allfx)^2,1,mean))
Variance <- mean(apply(Allfx_hat,1,var))

# 回归或分类两种情况
if (modeltype == 'reg'){
PredictError <- irreducible + SquareBais + Variance
}else{
PredictError <- mean(ifelse(Allfx_hat>=0.5,1,0)!=Allfx)
}
result <- data.frame(k,irreducible,SquareBais,Variance,PredictError)
return(result)
}

# -------------------- classification -------------------
modeltype <- 'classification'
num <- 100
n_test <- 1000
seeds <- 4

all_p <- seq(2,20,1)
result <- getError(1,num,modeltype,seeds,n_test)
for (i in all_p){
result <- rbind(result,getError(i,num,modeltype,seeds,n_test))
}

# ==================== CV  =========================

fun_cv <- function(trainset,kfold=10,seed_num =3,p){
set.seed(seed_num)
folds<-createFolds(y=trainset\$V21,kfold)
misrate <- NULL

for (i in(1:kfold)){

train_cv<-trainset[-folds[[i]],]
test_cv<-trainset[folds[[i]],]

model <- lm(formula = lm.BestSubSet(train_cv,p),train_cv)
result  <- predict(model,type='response',test_cv)

# result
misrate[i] <- mean( ifelse(result>=0.5,1,0) != test_cv\$V21)
}
sderror <- sd(misrate)/(length(misrate)^0.5)
misrate <- mean(misrate)
result <- data.frame(p,misrate,sderror)
return(result)
}

plot_error <- function(x, y, sd, len = 1, col = "black") {
len <- len * 0.05
arrows(x0 = x, y0 = y, x1 = x, y1 = y - sd, col = col, angle = 90, length = len)
arrows(x0 = x, y0 = y, x1 = x, y1 = y + sd, col = col, angle = 90, length = len)
}

# ================================   draw ==============================
# seed =   9,10,   92,65，114,  10912
seed_num =  9
trainset <- as.data.frame(matrix(runif(80*21,0,1),80))
trainset[,21] <- ifelse(trainset[,1] +trainset[,2] +trainset[,3] +trainset[,4] +trainset[,5]+
trainset[,6] +trainset[,7] +trainset[,8] +trainset[,9] +trainset[,10]>5,1,0)

resultcv <- fun_cv(trainset,kfold = 10,seed_num,p = 1)
for (p in 2:20){
resultcv <- rbind(resultcv,fun_cv(trainset,kfold = 10,seed_num,p))
}

png(file = "Cross Validation_large_testset.png")

plot(result\$k,result\$PredictError,type = 'o',col = 'red',
xlim = c(0,20),ylim = c(0,0.6),xlab = '', ylab ='', lwd = 2)
par(new = T)

plot(resultcv\$p,resultcv\$misrate,type='o',lwd=2,col='blue',ylim = c(0,0.6),xlim = c(0,20),
xlab = 'Subset Size p', ylab = 'Misclassification Error')
plot_error(resultcv\$p,resultcv\$misrate,resultcv\$sderror,col = 'blue',len = 1)
dev.off()```

Ruppert D. The Elements of Statistical Learning: Data Mining, Inference, and Prediction[J]. Journal of the Royal Statistical Society, 2010, 99(466):567-567.

59 篇文章33 人订阅

0 条评论