# R语言与机器学习（分类算法）朴素贝叶斯算法

## 贝叶斯统计基础

P（B|A）=P（AB）/P（A）

• P(A)是A的先验概率或边缘概率。之所以称为"先验"是因為它不考虑任何B方面的因素。
• P(A|B)是已知B发生后A的条件概率（直白来讲，就是先有B而后=>才有A），也由于得自B的取值而被称作A的后验概率。
• P(B|A)是已知A发生后B的条件概率（直白来讲，就是先有A而后=>才有B），也由于得自A的取值而被称作B的后验概率。
• P(B)是B的先验概率或边缘概率，也作标准化常量。

## 朴素贝叶斯

P(B|A)=P(b1|A)*P(b2|A)*…*P(bn|A)

“Good good study, Day day up.”

[plain] view plaincopyprint

1. data <-read.csv("D:/R/data/playing tennis.csv")
2. data<-data[,-1]#去掉了日期这一个没有可作为分类变量价值的变量
3. prior.yes<-sum(data[,5] =="Yes") / length(data[,5]);
4. prior.no<-sum(data[,5] =="No") / length(data[,5]);
5. bayespre<- function(condition) {
6. post.yes <-
7. sum((data[,1] == condition[1]) & (data[,5] == "Yes")) /sum(data[,5] == "Yes") *
8. sum((data[,2] == condition[2]) & (data[,5] == "Yes")) /sum(data[,5] == "Yes") *
9. sum((data[,3] == condition[3]) & (data[,5] == "Yes")) /sum(data[,5] == "Yes") *
10. sum((data[,4] == condition[4]) & (data[,5] == "Yes")) /sum(data[,5] == "Yes") *
11. prior.yes;
12. post.no <-
13. sum((data[,1] == condition[1]) & (data[,5] == "No")) /sum(data[,5] == "No") *
14. sum((data[,2] == condition[2]) & (data[,5] == "No")) /sum(data[,5] == "No") *
15. sum((data[,3] == condition[3]) & (data[,5] == "No")) /sum(data[,5] == "No") *
16. sum((data[,4] == condition[4]) & (data[,5] == "No")) /sum(data[,5] == "No") *
17. prior.no;
18. return(list(prob.yes = post.yes,
19. prob.no = post.no,
20. prediction = ifelse(post.yes>=post.no, "Yes", "No")));
21. }

[plain] view plaincopyprint?

1. bayespre(c("Rain","Hot","High","Strong"))
2. bayespre(c("Sunny","Mild","Normal","Weak"))
3. bayespre(c("Overcast","Mild","Normal","Weak"))

>bayespre(c("Rain","Hot","High","Strong"))

\$prob.yes

[1] 0.005291005

\$prob.no

[1] 0.02742857

\$prediction

[1] "No"

>bayespre(c("Sunny","Mild","Normal","Weak"))

\$prob.yes

[1] 0.02821869

\$prob.no

[1] 0.006857143

\$prediction

[1] "Yes"

>bayespre(c("Overcast","Mild","Normal","Weak"))

\$prob.yes

[1] 0.05643739

\$prob.no

[1] 0

\$prediction

[1] "Yes"

>bayespre(animals,c("no","yes","no","sometimes","yes"))

\$prob.mammals

[1] 0

\$prob.amphibians

[1] 0.1

\$prob.fishes

[1] 0

\$prob.reptiles

[1] 0.0375

\$prediction

[1] amphibians

Levels: amphibians birds fishesmammals reptiles

>bayespre(animals,c("no","yes","no","yes","no"))

\$prob.mammals

[1] 0.0004997918

\$prob.amphibians

[1] 0

\$prob.fishes

[1] 0.06666667

\$prob.reptiles

[1] 0

\$prediction

[1] fishes

Levels: amphibians birds fishesmammals reptiles

> bayespre(animals,c("yes","no","no","yes","no"))

\$prob.mammals

[1] 0.0179925

\$prob.amphibians

[1] 0

\$prob.fishes

[1] 0.01666667

\$prob.reptiles

[1] 0

\$prediction

[1] mammals

Levels: amphibians birds fishesmammals reptiles

>bayespre(c("foggy","Hot","High","Strong"))

\$prob.yes

[1] 0

\$prob.no

[1] 0

\$prediction

[1] "Yes"

P（xi|yj）=(nc+mp)/(n+m)

n是类yj中的样本总数，nc是类yj中取值xi的样本数，m是称为等价样本大小的参数，而p是用户指定的参数。如果没有训练集（即n=0），则P(xi|yj)=p, 因此p可以看作是在类yj的样本中观察属性值xi的先验概率。等价样本大小决定先验概率和观测概率nc/n之间的平衡，提高了估计的稳健性。

## R语言中Naive Bayes的实现函数

R的e1071包的naiveBayes函数提供了naive bayes的具体实现，其用法如下：

```## S3 method for class 'formula'
naiveBayes(formula, data, laplace = 0, ..., subset, na.action = na.pass)
## Default S3 method:
naiveBayes(x, y, laplace = 0, ...)```

[plain] view plaincopyprint?

1. data(Titanic)
2. m <- naiveBayes(Survived ~ ., data = Titanic)
3. m

## R中的文本处理工具

`Corpus(x,`
`       readerControl = list(reader = x\$DefaultReader, language = "en"),...)`

`tm_map(x, FUN, ..., useMeta = FALSE, lazy = FALSE)`
`       提供的FUN常用的有as.PlainTextDocument（将xml转化为纯文本）、stripWhitespace（去除多余空白）、tolower（转化为小写）、removeWords（去除停止词）、stemDocument（填充）等。`

Dictionary() 函数常用于在文本挖掘中展现相关的词条时。当将字典（Dictionary）传递到DocumentTermMatrix() 以后，生成的矩阵会根据字典提取计算词汇出现在每篇文档的频率。（这个在之后会有例子，就不多说了）

`strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)`

X：字串向量，每个元素都将单独进行拆分。

Split：为拆分位置的字串向量，默认为正则表达式匹配（fixed=FALSE）fixed=TRUE，表示使用普通文本匹配或正则表达式的精确匹配。

Perl：表示可以选择是否兼容Perl的正则表达式的表述方式。

## 朴素贝叶斯在文本挖掘中的算法

for每篇训练文档:

for每个类别:

if 词条in 文档：增加该词条计数值，增加所有词条计数值

For 每个类别:

For 每个词条

Prob=词条数目/总词条数目

Return prob

docId

Key word

class

1

“Adaptive weighting” “run length” “control chart”

spc

2

“run length” “control chart”

spc

3

“control chart” “EWMA” “run length”

spc

4

“D-Efficiency” ”Main Effect” “Quadratic Effect”

doe

P(”control chart”| spc)=(3+1)/(8+7)=4/15=2/7

P(”main effect”| spc) = (0+1)/(8+7)=1/15

P(”control chart”|doe)=(0+1)/(7+3)=0.1

P(spc |d)=4/15*4/15*1/15*8/11≈0.003447811

P(doe|d)= 0.1*0.1*0.2*0.1*3/11≈5.454545e-05

## 基于朴素贝叶斯的邮件分类

R代码：

1、建立词袋：

[plain] view plaincopyprint

1. library(tm)
2. txt1<-"D:/R/data/email/ham"
4. txtham<-tm_map(txtham,stripWhitespace)
5. txtham<-tm_map(txtham,tolower)
6. txtham<-tm_map(txtham,removeWords,stopwords("english"))
7. txtham<-tm_map(txtham,stemDocument)
8. txt2<-"D:/R/data/email/spam"
10. txtspam<-tm_map(txtspam,stripWhitespace)
11. txtspam<-tm_map(txtspam,tolower)
12. txtspam<-tm_map(txtspam,removeWords,stopwords("english"))
13. txtspam<-tm_map(txtspam,stemDocument)

2、词汇计数（包括词类数目与词量数目）

[plain] view plaincopyprint?

1. dtm1<-DocumentTermMatrix(txtham)
2. n1<-length(findFreqTerms(dtm1,1))
3. dtm2<-DocumentTermMatrix(txtspam)
4. n2<-length(findFreqTerms(dtm2,1))
5. setwd("D:/R/data/email/spam")
6. name<-list.files(txt2)
7. data1<-paste("spam",1:23)
8. lenspam<-0
9. for(i in 1:length(names)){
10. assign(data1[i],scan(name[i],"character"))
11. lenspam<-lenspam+length(get(data[i]))
12. }
13. setwd("D:/R/data/email/ham")
14. names<-list.files(txt1)
15. data<-paste("ham",1:23)
16. lenham<-0
17. for(i in 1:length(names)){
18. assign(data[i],scan(names[i],"character"))
19. lenham<-lenham+length(get(data[i]))
20. }

3、naive Bayes模型建立（使用m估计，p=1/m,m为词汇总数）

[plain] view plaincopyprint?

1. prob<-function(char,corp,len,n){
2. d<-Dictionary(char)
3. re<-DocumentTermMatrix(corp, list(dictionary = d));
4. as.matrix(re)
5. dtm<-DocumentTermMatrix(corp)
6. n<-length(findFreqTerms(dtm, 1))
7. prob<-(sum(re[,1])+1)/(n+len)
8. return(prob)
9. }
10. testingNB<-function(sentences){
11. pro1<-0.5
12. pro2<-0.5
13. for(i in1:length(sentences)){
14. pro1<-pro1*prob(sentences[i],txtham,lenham,n1)
15. }
16. for(i in1:length(sentences)){
17. pro2<-pro2*prob(sentences[i],txtspam,lenspam,n2)
18. }
19. return(list(prob.ham = pro1,
20. prob.span =pro2,
21. prediction =ifelse(pro1>=pro2/10, "ham", "spam")))
22. }

4、测试（利用test里的4封邮件,仅以ham2.txt，spam1.txt为例）

[plain] view plaincopyprint?

1. #读取文档，并且实现分词与填充
2. email<-scan("D:/R/data/email/test/ham2.txt","character")
3. sentences<-unlist(strsplit(email,",|\\?|\\;|\\!"))#分词
4. library(Snowball)#实现填充
5. a<-tolower(SnowballStemmer(sentences))# 实现填充并除去大小写因素
6. #测试
7. testingNB(a)

\$prob.ham

[1] 3.537766e-51

\$prob.span

[1] 4.464304e-51

\$prediction

[1] "ham"

\$prob.ham

[1] 5.181995e-95

\$prob.span

[1] 1.630172e-84

\$prediction

[1] "spam"

821 篇文章180 人订阅

0 条评论

## 相关文章

4066

1764

### Python环境下的8种简单线性回归算法

GitHub 地址：https://github.com/tirthajyoti/PythonMachineLearning/blob/master/Linea...

1340

1587

### 数据挖掘案例:基于 ReliefF和K-means算法的应用

.NET数据挖掘与机器学习 原文：http://www.cnblogs.com/asxinyu/archive/2013/08/29/3289682.html ...

2958

### 机器学习中导数最优化方法(基础篇)

1. 前言 熟悉机器学习的童鞋都知道，优化方法是其中一个非常重要的话题，最常见的情形就是利用目标函数的导数通过多次迭代来求解无约束最优化问题。实现简单，codi...

38613

1635

6.6K2

1114

### 如何做特征选择

1.数据挖掘与聚类分析概述 数据挖掘一般由以下几个步骤： (l)分析问题:源数据数据库必须经过评估确认其是否符合数据挖掘标准。以决定预期结果，也就选择了这项工作...

3595