# R语言与机器学习（分类算法）logistic回归

## 二、logit还是probit？

probit.predictions

versicolor virginica

versicolor 47 3

virginica 3 47

logit.predictions

versicolor virginica

versicolor 47 3

virginica 3 47

## 五、广义线性模型的R实现

R语言提供了广义线性模型的拟合函数glm()，其调用格式如下：

glm(formula, family = gaussian, data,weights, subset,

na.action, start = NULL, etastart, mustart, offset,

control= list(...), model = TRUE, method = "glm.fit",

x =FALSE, y = TRUE, contrasts = NULL, ...)

Formula：回归形式，与lm()函数的formula参数用法一致

Family：设置广义线性模型连接函数的典则分布族，glm()提供正态、指数、gamma、逆高斯、Poisson、二项分布。我们的logistic回归使用的是二项分布族binomial。Binomial族默认连接函数为logit，可设置为probit。

Data：数据集

[plain] view plaincopyprint

1. logit.fit <- glm(Species~Petal.Width+Petal.Length,
2. family = binomial(link = 'logit'),
3. data = iris[51:150,])
4. logit.predictions <- ifelse(predict(logit.fit) > 0,'virginica', 'versicolor')
5. table(iris[51:150,5],logit.predictions)
6. probit.fit <- glm(Species~Petal.Width+Petal.Length,
7. family = quasibinomial(link = 'probit'),
8. data = iris[51:150,])
9. probit.predictions <- ifelse(predict(probit.fit) >0,'virginica', 'versicolor')
10. table(iris[51:150,5],probit.predictions)

mlogit(formula, data, subset, weights,na.action, start = NULL,

alt.subset = NULL, reflevel = NULL,

nests = NULL, un.nest.el = FALSE, unscaled = FALSE,

heterosc = FALSE, rpar = NULL, probit = FALSE,

R = 40, correlation = FALSE, halton = NULL,

random.nb = NULL, panel = FALSE, estimate = TRUE,

seed = 10, ...)

`mlogit.data(data, choice, shape = c("wide","long"), varying = NULL,`
`            sep=".",alt.var = NULL, chid.var = NULL, alt.levels = NULL,id.var = NULL, opposite = NULL, drop.index = FALSE,`
`            ranked = FALSE, ...)`

`formula：mlogit提供了条件logit，多项logit，混合logit多种模型，对于多项logit的估计模型应写为：因变量~0|自变量,如：mode ~ 0 | income`
`data：使用mlogit.data函数使得数据结构符合mlogit函数要求。`
`Choice：确定分类变量是什么`
`Shape：如果每一行是一个观测，我们选择wide，如果每一行是表示一个选择，那么就应该选择long。`
`alt.var：对于shape为long的数据，需要标明所有的选择名称`
`        选择wide的数据示例：`
`           选择long的数据示例：`
`     以fishing数据为例，来说明如何使用mlogit。`
`[plain] view plaincopyprintlibrary(mlogit)  data("Fishing", package = "mlogit")  Fish <- mlogit.data(Fishing,shape = "wide",choice = "mode")  summary(mlogit(mode ~ 0 | income, data = Fish))  `
`     这个输出的结果与nnet包中的multinom()函数一致。由于mlogit包可以做的logit模型更多。`
`     程序包MASS提供polr()函数可以进行ordered logit或probit回归。用法如下：`
`polr(formula, data, weights, start, ..., subset, na.action,`
`     contrasts = NULL, Hess = FALSE, model = TRUE,`
`     method = c("logistic", "probit", "cloglog", "cauchit"))`
`参数说明：`

Formula：回归形式，与lm()函数的formula参数用法一致

`Data：数据集`
`Method：默认为order logit，选择probit时变为order probit模型。`
`以housing数据为例说明函数用法：`
`[plain] view plaincopyprinthouse.plr <- polr(Sat ~ Infl + Type + Cont, weights = Freq, data = housing)  house.plr  summary(house.plr, digits = 3)  `
`这些结果十分直观，易于解读，所以我们在这里省略所有的输出结果。`

## 再看手写数字案例：

`      由于手写数字的特征选取很容易导致回归系数矩阵是降秩的，所以我们使用nnet包的multinom()函数代替mlogit()。`
`      运行下列代码：`
`[plain] view plaincopyprintsetwd("D:/R/data/digits/trainingDigits")  names<-list.files("D:/R/data/digits/trainingDigits")  data<-paste("train",1:1934,sep="")  for(i in 1:length(names))           assign(data[i],as.matrix(read.fwf(names[i],widths=rep(1,32))))     label<-factor(rep(0:9,c(189,198,195,199,186,187,195,201,180,204)))     feature<-matrix(rep(0,length(names)*25),length(names),25)  for(i in 1:length(names)){                   feature[i,1]<-sum(get(data[i])[,16])                   feature[i,2]<-sum(get(data[i])[,8])                   feature[i,3]<-sum(get(data[i])[,24])                   feature[i,4]<-sum(get(data[i])[16,])                   feature[i,5]<-sum(get(data[i])[11,])                   feature[i,6]<-sum(get(data[i])[21,])                   feature[i,7]<-sum(diag(get(data[i])))                   feature[i,8]<-sum(diag(get(data[i])[,32:1]))                   feature[i,9]<-sum((get(data[i])[17:32,17:32]))                   feature[i,10]<-sum((get(data[i])[1:8,1:8]))                   feature[i,11]<-sum((get(data[i])[9:16,1:8]))                   feature[i,12]<-sum((get(data[i])[17:24,1:8]))                   feature[i,13]<-sum((get(data[i])[25:32,1:8]))                   feature[i,14]<-sum((get(data[i])[1:8,9:16]))                   feature[i,15]<-sum((get(data[i])[9:16,9:16]))                   feature[i,16]<-sum((get(data[i])[17:24,9:16]))                   feature[i,17]<-sum((get(data[i])[25:32,9:16]))                   feature[i,18]<-sum((get(data[i])[1:8,17:24]))                   feature[i,19]<-sum((get(data[i])[9:16,17:24]))                   feature[i,20]<-sum((get(data[i])[17:24,17:24]))                   feature[i,21]<-sum((get(data[i])[25:32,17:24]))                   feature[i,22]<-sum((get(data[i])[1:8,25:32]))                   feature[i,23]<-sum((get(data[i])[9:16,25:32]))                   feature[i,24]<-sum((get(data[i])[17:24,25:32]))                   feature[i,25]<-sum((get(data[i])[25:32,25:32]))  }  data1 <- data.frame(feature,label)     #降秩时mlogit不可用  #data10<- mlogit.data(data1,shape = "wide",choice = "label")  #m1<-mlogit(label~0|X1+X2+X3+X4+X5+X6+X7+X8+X9+X10+X11+X12+X13+X14+X15+X16+X17+X18+X19+X20+X21+X22+X23+X24+X25,data=data10)     library(nnet)  m1<-multinom(label ~ ., data = data1)  pred<-predict(m1,data1)  table(pred,label)  sum(diag(table(pred,label)))/length(names)           setwd("D:/R/data/digits/testDigits")  name<-list.files("D:/R/data/digits/testDigits")  data1<-paste("train",1:1934,sep="")  for(i in 1:length(name))           assign(data1[i],as.matrix(read.fwf(name[i],widths=rep(1,32))))     feature<-matrix(rep(0,length(name)*25),length(name),25)  for(i in 1:length(name)){                   feature[i,1]<-sum(get(data1[i])[,16])                   feature[i,2]<-sum(get(data1[i])[,8])                   feature[i,3]<-sum(get(data1[i])[,24])                   feature[i,4]<-sum(get(data1[i])[16,])                   feature[i,5]<-sum(get(data1[i])[11,])                   feature[i,6]<-sum(get(data1[i])[21,])                   feature[i,7]<-sum(diag(get(data1[i])))                   feature[i,8]<-sum(diag(get(data1[i])[,32:1]))                   feature[i,9]<-sum((get(data1[i])[17:32,17:32]))                   feature[i,10]<-sum((get(data1[i])[1:8,1:8]))                   feature[i,11]<-sum((get(data1[i])[9:16,1:8]))                   feature[i,12]<-sum((get(data1[i])[17:24,1:8]))                   feature[i,13]<-sum((get(data1[i])[25:32,1:8]))                   feature[i,14]<-sum((get(data1[i])[1:8,9:16]))                   feature[i,15]<-sum((get(data1[i])[9:16,9:16]))                   feature[i,16]<-sum((get(data1[i])[17:24,9:16]))                   feature[i,17]<-sum((get(data1[i])[25:32,9:16]))                   feature[i,18]<-sum((get(data1[i])[1:8,17:24]))                   feature[i,19]<-sum((get(data1[i])[9:16,17:24]))                   feature[i,20]<-sum((get(data1[i])[17:24,17:24]))                   feature[i,21]<-sum((get(data1[i])[25:32,17:24]))                   feature[i,22]<-sum((get(data1[i])[1:8,25:32]))                   feature[i,23]<-sum((get(data1[i])[9:16,25:32]))                   feature[i,24]<-sum((get(data1[i])[17:24,25:32]))                   feature[i,25]<-sum((get(data1[i])[25:32,25:32]))  }  labeltest<-factor(rep(0:9,c(87,97,92,85,114,108,87,96,91,89)))  data2<-data.frame(feature,labeltest)  pred1<-predict(m1,data2)  table(pred1,labeltest)  sum(diag(table(pred1,labeltest)))/length(name)  `
`经整理，输出结果如下：（左边为训练集，右边为测试集）`
`Tips: oddsratio=p/1-p 相对风险指数贝努力模型中 P是发生A事件的概率，1-p是不发生A事件的概率所以p/1-p是 发生与不发生的相对风险。OR值等于1，表示该因素对A事件发生不起作用；OR值大于1，表示A事件发生的可能性大于不发生的可能性；OR值小于1，表示A事件不发生的可能性大于发生的可能性。在此公众号回复“R实现”可获取本文word版，包括代码和数据集。更多资料加微信：hai299014小编很负责地提醒您：【1】一定要在订单的备注栏填写邮箱，以便通过邮箱发资料；点击，下载资料`

821 篇文章169 人订阅

0 条评论

## 相关文章

2408

### 《机器学习基石》课程学习总结（一）

《机器学习基石》课程非常棒，作为总结，本文重点是梳理课程中的知识脉络，同时尽可能说白话，让没有机器学习背景的朋友也能看懂。 这个课程好在哪里？ 1、最大的好 课...

4375

### 人群运动--Scene-Independent Group Profiling in Crowd

Scene-Independent Group Profiling in Crowd CVPR2014 http://www.ee.cuhk.edu.hk/...

1969

### 视频 | 论文最爱的变分自编码器（ VAE），不了解一下？

AI 科技评论按：喜欢机器学习和人工智能，却发现埋头苦练枯燥乏味还杀时间？这里是，雷锋字幕组编译的 Arxiv Insights 专栏，从技术角度出发，带你轻松...

4317

1995

3917

932

33011

2803

7586