机器学习-将多项式朴素贝叶斯应用于NLP问题

XXXX-user

发布于 2019-09-25 11:16:32

8450

发布于 2019-09-25 11:16:32

文章被收录于专栏：不仅仅是python

朴素贝叶斯分类器算法是一系列概率算法，基于贝叶斯定理和每对特征之间条件独立的“朴素”假设而应用。贝叶斯定理计算概率P（c | x），其中c是可能结果的类别，x是必须分类的给定实例，表示某些特定特征。

P(c|x) = P(x|c) * P(c) / P(x)

朴素贝叶斯主要用于自然语言处理（NLP）问题。朴素贝叶斯预测文本的标签。他们计算给定文本的每个标签的概率，然后输出最高标签的标签。

朴素贝叶斯算法如何工作？

让我们考虑一个示例，对评论进行正面或负面的分类。

TEXT	REVIEWS
“I liked the movie”	positive
“It’s a good movie. Nice story”	positive
“Nice songs. But sadly boring ending. ”	negative
“Hero’s acting is bad but heroine looks good. Overall nice movie”	positive
“Sad, boring movie”	negative

我们对“总体喜欢这部电影”的文字进行正面评价还是负面评价。我们必须计算 P（正面|总体上喜欢这部电影） —假定句子“总体上喜欢这部电影”，则该句子的标签为正的概率。 P（负|总体上喜欢这部电影） —假定句子“总体上喜欢这部电影”，则句子的标签为负的概率。

在此之前，首先，我们在文本中应用“删除停用词并阻止”。

删除停用词：这些是常用词，实际上并没有真正添加任何内容，例如，有能力的，甚至其他的，等等。

词根提取：词根提取。

现在，在应用了这两种技术之后，我们的文本变为

TEXT	REVIEWS
“ilikedthemovi”	positive
“itsagoodmovienicestori”	positive
“nicesongsbutsadlyboringend”	negative
“herosactingisbadbutheroinelooksgoodoverallnicemovi”	positive
“sadboringmovi”	negative

特征工程： 重要的部分是从数据中找到特征，以使机器学习算法起作用。在这种情况下，我们有文字。我们需要将此文本转换为可以进行计算的数字。我们使用词频。那就是将每个文档视为包含的一组单词。我们的功能将是每个单词的计数。

在本例中，通过使用以下定理，我们得到 P(positive | overall liked the movie)：

P(positive | overall liked the movie) = P(overall liked the movie | positive) * P(positive) / P(overall liked the movie)

由于对于我们的分类器，我们必须找出哪个标签具有更大的概率，因此我们可以舍弃两个标签相同的除数，

P(overall liked the movie | positive)* P(positive) with P(overall liked the movie | negative) * P(negative)

但是存在一个问题：“总体上喜欢这部电影”没有出现在我们的训练数据集中，因此概率为零。在这里，我们假设“朴素”的条件是句子中的每个单词都独立于其他单词。这意味着现在我们来看单个单词。

我们可以这样写：

P(overall liked the movie) = P(overall) * P(liked) * P(the) * P(movie)

下一步就是应用贝叶斯定理：

现在，这些单词实际上在我们的训练数据中出现了几次，我们可以计算出来！

计算概率：

首先，我们计算每个标签的先验概率：对于我们训练数据中的给定句子，其为正P（positive）的概率为3/5。那么，P（negative）是2/5。

然后，计算P（overall | positive）意味着计算单词“ overall”在肯定文本（1）中出现的次数除以肯定（11）中的单词总数。因此，P(overall | positive) = 1/17， P(liked/positive) = 1/17，P(the/positive)= 2/17，P(movie/positive)= 3/17。

如果概率为零，则使用拉普拉斯平滑法：我们向每个计数加1，因此它永远不会为零。为了平衡这一点，我们将可能单词的数量添加到除数中，因此除法永远不会大于1。在我们的情况下，可能单词的总数为21。

应用平滑，结果为：

WORD	P(WORD \| POSITIVE)	P(WORD \| NEGATIVE)
overall	1 + 1/17 + 21	0 + 1/7 + 21
liked	1 + 1/17 + 21	0 + 1/7 + 21
the	2 + 1/17 + 21	0 + 1/7 + 21
movie	3 + 1/17 + 21	1 + 1/7 + 21

现在我们将所有概率相乘，看看谁更大：

P(overall | positive) * P(liked | positive) * P(the | positive) * P(movie | positive) * P(postive ) = 1.38 * 10^{-5} = 0.0000138

P(overall | negative) * P(liked | negative) * P(the | negative) * P(movie | negative) * P(negative) = 0.13 * 10^{-5} = 0.0000013

我们的分类器为“总体喜欢这部电影”赋予了肯定的标签。

下面是实现：

#导入包这里用到了NLTK

import pandas as pd

import re

import nltk

from nltk.corpus import stopwords

from nltk.stem.porter import PorterStemmer

from sklearn.feature_extraction.text import CountVectorizer

dataset = [["I liked the movie", "positive"],

["It’s a good movie. Nice story", "positive"],

["Hero’s acting is bad but heroine looks good.\

Overall nice movie", "positive"],

["Nice songs. But sadly boring ending.", "negative"],

["sad movie, boring movie", "negative"]]

dataset = pd.DataFrame(dataset)

dataset.columns = ["Text", "Reviews"]

nltk.download('stopwords')

corpus = []

for i in range(0, 5):

text = re.sub('[^a-zA-Z]', '', dataset['Text'][i])

text = text.lower()

text = text.split()

ps = PorterStemmer()

text = ''.join(text)

corpus.append(text)

# 创建单词模型库

cv = CountVectorizer(max_features = 1500)

X = cv.fit_transform(corpus).toarray()

y = dataset.iloc[:, 1].values

# 分隔数据设置训练数据和测试数据

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size = 0.25, random_state = 0)

# 使用朴素贝叶斯高斯分布训练数据

from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import confusion_matrix

classifier = GaussianNB();

classifier.fit(X_train, y_train)

# 预测测试结果

y_pred = classifier.predict(X_test)

# 制作混乱矩阵

cm = confusion_matrix(y_test, y_pred)

cm

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2019-09-23，如有侵权请联系 cloudcommunity@tencent.com 删除

机器学习

本文分享自 yale记微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

0 条评论

热度

机器学习-将多项式朴素贝叶斯应用于NLP问题

机器学习-将多项式朴素贝叶斯应用于NLP问题

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐