# 朴素贝叶斯

## 朴素贝叶斯分类器

\hat{c}=\underset{c \in C}{\operatorname{argmax}} P(c | d)

c^ 就是：在所有的类别C={c1，c2，……cm} 中，使得：条件概率P(c|d)取最大值的类别。使用贝叶斯公式，将上式转换成如下形式：

\hat{c}=\underset{c \in C}{\operatorname{argmax}} P(c | d)=\underset{c \in C}{\operatorname{argmax}} \frac{P(d | c) P(c)}{P(d)}

\hat{c}=\underset{c \in C}{\operatorname{argmax}} P(c | d)=\underset{c \in C}{\operatorname{argmax}} P(d | c) P(c)

\hat{c}=\underset{c \in C}{\operatorname{argmax}} \overbrace{P\left(f_{1}, f_{2}, \ldots, f_{n} | c\right)}^{\text { likelihood }} \overbrace{P(c)}^{\text { prior }}

c_{N B}=\underset{c \in C}{\operatorname{argmax}} P(c) \prod_{f \in F} P(f | c)

c_{N B}=\underset{c \in C}{\operatorname{largmax}} \log P(c)+\sum_{i \in \text {positions}} \log P\left(w_{i} | c\right)

## 训练朴素贝叶斯分类器

\hat{P}(c)=\frac{N_{c}}{N_{d o c}}

\hat{P}\left(w_{i} | c\right)=\frac{\operatorname{count}\left(w_{i}, c\right)}{\sum_{w \in V} \operatorname{count}(w, c)}

## unknow words的情形

\hat{P}\left(w_{i} | c\right)=\frac{\operatorname{count}\left(w_{i}, c\right)+1}{\sum_{w \in V}(\operatorname{count}(w, c)+1)}=\frac{\operatorname{count}\left(w_{i}, c\right)+1}{\left(\sum_{w \in V} \operatorname{count}(w, c)\right)+|V|}

## 朴素贝叶斯分类示例

-  just plain boring
-  entirely predictable and lacks energy
-  no surprises and very few laughs
+  very powerful
+  the most fun film of the summer

predictable with no fun

very、the)重复出现了两次，故词库V的大小为 20。因此单词predictable对应的似然概率如下：

p(predictable|’-‘)=(1+1)/(14+20)=2/34

p(fun|’-‘)=(0+1)/(14+20) p(fun|’+’)=(1+1)/(9+20)

## 代码实现

import numpy as np

"""
:return: 文本数据集 和 对应的 label
"""
text=[['just','plain','boring'],

['entirely','predictable','and','lacks','energy'],

['no','surprises','and','very','few','laughs'],

['very','powerful'],

['the' ,'most', 'fun' ,'film' ,'of' ,'the', 'summer']]

label=[0,0,0,1,1]

return text,label

def createVocabList(text):
"""
:param text: 文本数据集
:return: 词语表
"""
vocabSet=set([])
for document in text:
vocabSet=vocabSet|set(document)
return list(vocabSet)

def bag_words_vec(vocab, text):
"""
:param vocab: 词表
:param text: 文本数据集
:return: 通过词袋模型转换后的向量
"""
data = []
for t in text:
vec = [0]*len(vocab)
for word in t:
if word in vocab:
vec[vocab.index(word)]+=1
data.append(vec)
return data

class NB():
def __init__(self,vocab):
self.data = None
self.label = None
self.vocab = vocab
self.vocab_len = len(self.vocab)

def fit(self,data,label):
self.data = data
self.label = label

# 计算每个类别的先验概率
self.pc0 = label.count(0)/len(label)
self.pc1 = label.count(1)/len(label)

# 分出不同类别的数据
self.data0 = [data[i] for i,c in enumerate(label) if c==0]
self.data1 = [data[i] for i,c in enumerate(label) if c==1]

# 计算每个类别中单词的数量，注意如果是len(),就不考虑重复元素，sum()考虑重复元素
self.word_num_0 = sum([i for t in self.data0 for i in t if i!=0])
self.word_num_1 = sum([i for t in self.data1 for i in t if i!=0])

#打印不同类别的单词个数和词表长度
print("类别0的单词个数：{},类别1的单词个数：{},总词表长度：{}".format(self.word_num_0,self.word_num_1,self.vocab_len))

self.word_freq_0 = np.sum(self.data0, axis = 0) #计算每个类别中单词的频率
self.word_freq_1 = np.sum(self.data1, axis = 0)

def pridict(self, text):
# 预测过程

pred = []

# 对于预测集的每一个文本
for t in text:

# 计算属于class 0的概率
p0 = []
for w in t:
# 不在词表中的不计算
if w in vocab:
p0.append((self.word_freq_0[self.vocab.index(w)] + 1)/(self.word_num_0+ self.vocab_len))

# 计算属于class 0的概率
p1 = []
for w in t:
# 不在词表中的不计算
if w in vocab:
p1.append((self.word_freq_1[self.vocab.index(w)] + 1)/(self.word_num_1+self.vocab_len))

print(p0)
print(p1)

print(self.pc0*np.prod(p0))
print(self.pc1*np.prod(p1))

if self.pc0*np.prod(p0)> self.pc1*np.prod(p1):
pred.append(0)
else:
pred.append(1)

return pred

if __name__ == '__main__':

vocab = createVocabList(text)
data = bag_words_vec(vocab, text)

pred_text = [["predictable","with","no","fun"]]

model = NB(vocab)
model.fit(data,label)
result = model.pridict(pred_text)
print(result)

类别0的单词个数：14,类别1的单词个数：9,总词表长度：20
[0.058823529411764705, 0.058823529411764705, 0.029411764705882353]
[0.034482758620689655, 0.034482758620689655, 0.06896551724137931]
6.106248727864848e-05
3.280167288531715e-05
[0]

