# 朴素贝叶斯基于概率论的分类算法

## Bayes

Source: WikiPedia In probability theory and statistics, Bayes’ theorem (alternatively Bayes’ law or Bayes’ rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if cancer is related to age, then, using Bayes’ theorem, a person’s age can be used to more accurately assess the probability that they have cancer, compared to the assessment of the probability of cancer made without knowledge of the person’s age.

## 概率分类器原理

### 朴素贝叶斯

=n∏j=1P(aj|ci)P(ci)=∏j=1nP(aj|ci)P(ci)

## 编写一个简单的朴树贝叶斯分类器

### 准备数据

1. 我们先准备好一些数据，主要这次使用的斑点犬爱好者的留言板, 并且已经标注出是侮辱性（1）还是非侮辱性的语句（0）。 1 2 3 4 5 6 7 8 9 def loadDataSet(): postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] classVec = [0,1,0,1,0,1] #1 is abusive, 0 not return postingList,classVec
2. 统计所有出现的词，构建一个字典Set 1 2 3 4 5 def createVocabList(dataSet): vocabSet = set([]) #create empty set for document in dataSet: vocabSet = vocabSet | set(document) #union of the two sets return list(vocabSet)
3. 根据构建好的字典，生成向量。将生成一个与字典长度一样的所有元素为0的向量，将句子中出现的单词的位置标注为1，表示为出现。 1 2 3 4 5 6 7 def setOfWords2Vec(vocabList, inputSet): returnVec = [0]*len(vocabList) for word in inputSet: if word in vocabList: returnVec[vocabList.index(word)] = 1 else: print "the word: %s is not in my Vocabulary!" % word return returnVec

### 训练算法

1. 计算多个概率的乘积时，如果出现某个类别为0，那么结果也为0。为了避免这种影响，将生成1的初始向量，并将分母改成2（出现和不出现都是0.5）。
2. 由于概率都是0以下数，因子非常小，导致乘积结果也非常小，导致程序下溢出或者得不到正确答案，采用对乘积取自然对数的方法避免。（f(x)f(x)与ln(f(x))ln(f(x))图像相似，xx取值范围是[0,0.5][0,0.5]）

12345678910111213141516

def trainNB0(trainMatrix,trainCategory): numTrainDocs = len(trainMatrix) numWords = len(trainMatrix[0]) pAbusive = sum(trainCategory)/float(numTrainDocs) p0Num = ones(numWords); p1Num = ones(numWords) #change to ones() p0Denom = 2.0; p1Denom = 2.0 #change to 2.0 for i in range(numTrainDocs): if trainCategory[i] == 1: p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) p1Vect = log(p1Num/p1Denom) #change to log() p0Vect = log(p0Num/p0Denom) #change to log() return p0Vect,p1Vect,pAbusive

### 测试算法

1234567

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): p1 = sum(vec2Classify * p1Vec) + log(pClass1) #element-wise mult p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1) if p1 > p0: return 1 else: return 0

12345678910111213

def testingNB(): listOPosts,listClasses = loadDataSet() myVocabList = createVocabList(listOPosts) trainMat=[] for postinDoc in listOPosts: trainMat.append(setOfWords2Vec(myVocabList, postinDoc)) p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses)) testEntry = ['love', 'my', 'dalmation'] thisDoc = array(setOfWords2Vec(myVocabList, testEntry)) print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb) testEntry = ['stupid', 'garbage'] thisDoc = array(setOfWords2Vec(myVocabList, testEntry)) print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)

94 篇文章48 人订阅

0 条评论

## 相关文章

25460

22540

33190

52650

25730

349100

### Improving Deep Neural Networks学习笔记(二)

4. Optimization algorithms 4.1 Mini-batch gradient descent xtx^{\\{t\\}}，yty^{\\...

22360

13420

16550

13720