朴素贝叶斯做文本分类

spark

发布于 2018-12-20 11:47:55

9462

发布于 2018-12-20 11:47:55

文章被收录于专栏：数据科学

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

加载查看数据¶

In [2]:

df = pd.read_csv('/Users/spark/Downloads/Restaurant_Reviews.tsv',sep='\t')

In [3]:

df.head()

Out[3]:

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	Review	Liked
0	Wow... Loved this place.	1
1	Crust is not good.	0
2	Not tasty and the texture was just nasty.	0
3	Stopped by during the late May bank holiday of...	1
4	The selection on the menu was great and so wer...	1

In [4]:

df.describe()

Out[4]:

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	Liked
count	1000.00000
mean	0.50000
std	0.50025
min	0.00000
25%	0.00000
50%	0.50000
75%	1.00000
max	1.00000

In [5]:

df.dtypes

Out[5]:

Review    object
Liked      int64
dtype: object

初步统计分析¶

In [6]:

# df['text_length'] = df.Review.map(len)
df['word_length'] = df.Review.map(lambda x:len(x.split(' ')))

In [7]:

df.corr()

Out[7]:

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	Liked	word_length
Liked	1.000000	-0.096573
word_length	-0.096573	1.000000

In [8]:

g = sns.FacetGrid(data=df, col='Liked')
g.map(plt.hist, 'word_length', bins=50)

Out[8]:

<seaborn.axisgrid.FacetGrid at 0x10e6e0d30>

In [9]:

sns.boxplot(x='Liked', y='word_length', data=df)

Out[9]:

<matplotlib.axes._subplots.AxesSubplot at 0x1108c8278>

可以看出，是否喜欢和文字长度没有相关性

机器学习处理¶

编码处理¶

In [10]:

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from sklearn.feature_extraction.text import CountVectorizer

[nltk_data] Downloading package stopwords to /Users/spark/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

In [11]:

import string
def text_process(text):
    '''
    按照下面方式处理字符串
    1. 去除标点符号
    2. 去掉无用词
    3. 返回剩下的词的list
    '''
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

In [12]:

X = df.Review
y = df.Liked
bow_transformer = CountVectorizer(analyzer=text_process).fit(X)
X = bow_transformer.transform(X)

训练¶

In [13]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

naive_bayes：朴素贝叶斯
MultinomialNB：假设特征的先验概率为多项式分布

In [14]:

from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train, y_train)

Out[14]:

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [15]:

preds = nb.predict(X_test)

预测¶

In [16]:

my_test_review = 'room is  bad'
my_test_review_transformed = bow_transformer.transform([my_test_review])
nb.predict(my_test_review_transformed)[0]

Out[16]:

In [17]:

my_test_review = 'room is expensive'
my_test_review_transformed = bow_transformer.transform([my_test_review])
nb.predict(my_test_review_transformed)[0]

Out[17]:

In [18]:

my_test_review = 'suprise me'
my_test_review_transformed = bow_transformer.transform([my_test_review])
nb.predict(my_test_review_transformed)[0]

Out[18]:

In [19]:

my_test_review = 'amazing'
my_test_review_transformed = bow_transformer.transform([my_test_review])
nb.predict(my_test_review_transformed)[0]

Out[19]:

模型评估¶

准确率在74%

In [20]:

from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, preds))
print('\n')
print(classification_report(y_test, preds))

[[ 96  54]
 [ 27 123]]


              precision    recall  f1-score   support

           0       0.78      0.64      0.70       150
           1       0.69      0.82      0.75       150

   micro avg       0.73      0.73      0.73       300
   macro avg       0.74      0.73      0.73       300
weighted avg       0.74      0.73      0.73       300

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2018年12月08日，如有侵权请联系 cloudcommunity@tencent.com 删除

其他

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

其他

登录后参与评论

0 条评论

热度