Classifying documents with Naïve Bayes使用朴素贝叶斯分类文本

到不了的都叫做远方

修改于 2020-05-06 11:44:53

4140

修改于 2020-05-06 11:44:53

Naïve Bayes is a really interesting model. It's somewhat similar to k-NN in the sense that it makes some assumptions that might oversimplify reality, but still perform well in many cases.

朴素贝叶斯是真真有趣的模型。这是一种与K-NN的思想相似，它们做了一些假设来简化实际情况，但是在很多情况下表现良好。

Getting ready准备工作

In this recipe, we'll use Naïve Bayes to do document classification with sklearn. An example I have personal experience of is using the words that make up an account descriptor in accounting, such as Accounts Payable, and determining if it belongs to Income Statement,Cash Flow Statement, or Balance Sheet.

在这部分，我们将要使用sklearn朴素贝叶斯方法来分类文本。有个例子，我有一个个人经历是使用在账户中出现的账户描述符组成的单词，例如应付账款，决定它是否属于收益表，现金流量表或者资产负债表

The basic idea is to use the word frequency from a labeled test corpus to learn the classifications of the documents. Then, we can turn this on a training set and attempt to predict the label.

基本的思想就是在一个测试文集当中使用词语的频数来学习文本的分类方法。然后，我们能转换到一个训练集并且尝试预测标签。

We'll use the newgroups dataset within sklearn to play with the Naïve Bayes model. It's a nontrivial amount of data, so we'll fetch it instead of loading it. We'll also limit the categories to rec.autos and rec.motorcycles :

我们将要对sklearn中新的数据集来使用朴素贝叶斯模型，这是一个非数值的数据，所以我们需要获取它而不是导入它。我们要限制rec.autos/rec.motocycles的类别：

from sklearn.datasets import fetch_20newsgroups
categories = ["rec.autos", "rec.motorcycles"]
newgroups = fetch_20newsgroups(categories=categories)

#take a look看一看

print('\n'.join(newgroups.data[:1]))
From: gregl@zimmer.CSUFresno.EDU (Greg Lewis)
Subject: Re: WARNING.....(please read)...
Keywords: BRICK, TRUCK, DANGER
Nntp-Posting-Host: zimmer.csufresno.edu
Organization: CSU Fresno
Lines: 33
[...]

newgroups.target_names[newgroups.target[:1][0]]
'rec.autos'

Now that we have newgroups , we'll need to represent each document as a bag of words. This representation is what gives Naïve Bayes its name. The model is "naive" because documents are classified without regard for any intra-document word covariance. This might be considered a flaw, but Naïve Bayes has been shown to work reasonably well.

限制我们有了新的分组，我们将用一个词包来代替每个文本。这个代表是朴素贝叶斯给予的名字。这个模型是“朴素”是因为文本并没有考虑分开文本词语之间的协方差，这可能被认为是一个缺点，但是朴素贝叶斯表现的还不错。

We need to preprocess the data into a bag-of-words matrix. This is a sparse matrix that has entries when the word is present in the document. This matrix can become quite large,as illustrated:

我们需要预处理数据为一个词包矩阵，这是一个当词语代表了文本中的词后形成的稀疏矩阵，这个矩阵能够变得像说明书一样庞大。

from sklearn.feature_extraction.text import CountVectorizer
count_vec = CountVectorizer()
bow = count_vec.fit_transform(newgroups.data)

This matrix is a sparse matrix, which is the length of the number of documents by each word.The document and word value of the matrix are the frequency of the particular term:

这个矩阵是文本里的每一个长度组成的稀疏矩阵，这个文本和矩阵的字母值是特殊项出现的重复率

bow
<1192x19177 sparse matrix of type '<type 'numpy.int64'>'
with 164296 stored elements in Compressed Sparse Row format>

We'll actually need the matrix as a dense array for the Naïve Bayes object. So, let's convert it back:

对于朴素贝叶斯方法，我们确实需要矩阵，如一个密度数组。所以让我们把他转换回去

bow = np.array(bow.todense())

Clearly, most of the entries are 0, but we might want to reconstruct the document counts as a sanity check:

清晰的，很多的值是0，但是我们可能想要重构文本数如同一个明智的选择。

words = np.array(count_vec.get_feature_names())
words[bow[0] > 0][:5]
array(['10pm', '1qh336innfl5', '33', '93740',
       '___________________________________________________________________'],
      dtype='<U79')

Now, are these the examples in the first document? Let's check that using the following command:

选择，例子是不是在第一个文本，让我们来使用以下代码检查。

'10pm' in newgroups.data[0].lower()
True
'1qh336innfl5' in newgroups.data[0].lower()
True

How to do it...怎么做

Ok, so it took a bit longer than normal to get the data ready, but we're dealing with text data that isn't as quickly represented as a matrix as the data we're used to.However, now that we're ready, we'll fire up the classifier and fit our model:

好了，所以看得比常规的取到已有数据更远些，但是我们决定文本数据并没有我们以往使用的那么快，然而，现在我们准备好，我们可以激活分类器并拟合它。

from sklearn import naive_bayes
clf = naive_bayes.GaussianNB()

Before we fit the model, let's split the dataset into a training and test set:在你和模型之前，然我们把数据集分为训练集和测试集。

mask = np.random.choice([True, False], len(bow))
clf.fit(bow[mask], newgroups.target[mask])
predictions = clf.predict(bow[~mask])

from sklearn.model_selection import train_test_split
train_bow, test_bow , train_target, test_target= train_test_split(bow,newgroups.target)
clf.fit(train_bow, train_target)
predictions = clf.predict(test_bow)

Now that we fit a model on a test set, and then predicted the training set in an attempt to determine which categories go with which articles, let's get a sense of the approximate accuracy:

选择，我们在测试集上拟合模型，然后预测训练集企图决定哪个需要分类成为哪一项，让我们得到一个近似准确性的理念。

np.mean(predictions == test_target)
0.959731543624161

How it works...怎么运行的

The fundamental idea of how Naïve Bayes works is that we can estimate the probability of some data point being a class, given the feature vector. This can be rearranged via the Bayes formula to give the MAP estimate for the feature vector.This MAP estimate chooses the class for which the feature vector's probability is maximized.

朴素贝叶斯的最基本的思想是我们可以估计在一个类中的部分数据的概率，给出特征向量。这将被重排列经由对贝叶斯方程的特征向量来给MAP估计。这个MAP估计挑选特征可能是最值的分类。

There's more...扩展阅读

We can also extend Naïve Bayes to do multiclass work. Instead of assuming a Gaussian likelihood, we'll use a multinomial likelihood.

我们也能拓展朴素贝叶斯来解决大量的工作，代替使用假设高斯相似性，我们使用一个多项式可能性

First, let's get a third category of data:首先，我们得到一个三分类向量

from sklearn.datasets import fetch_20newsgroups
mn_categories = ["rec.autos", "rec.motorcycles","talk.politics.guns"]
mn_newgroups = fetch_20newsgroups(categories=mn_categories)

We'll need to vectorize this just like the class case:我们需要像分类例子中那样向量化它们。

mn_bow = count_vec.fit_transform(mn_newgroups.data)
mn_bow = np.array(mn_bow.todense())

Let's create a mask array to train and test:让我们生成虚拟的数组来训练和测试

mn_mask = np.random.choice([True, False], len(mn_newgroups.data))
multinom = naive_bayes.MultinomialNB()
multinom.fit(mn_bow[mn_mask], mn_newgroups.target[mn_mask])
mn_predict = multinom.predict(mn_bow[~mn_mask])
np.mean(mn_predict == mn_newgroups.target[~mn_mask])
0.945067264573991

It's not completely surprising that we did well. We did fairly well in the dual class case, and since one will guess that the talk.politics.guns category is fairly orthogonal to the other two, we should probably do pretty well.

我们做的很好这不完全惊奇。我们得到双分类问题上的很好的结果，并且当一个分类被使用talk.politics.guns猜中意味着正交于其他两个向量，我们一个做的更好。

本文系外文翻译，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

机器学习