# 机器学习-将多项式朴素贝叶斯应用于NLP问题

`P(c|x) = P(x|c) * P(c) / P(x)`

TEXT

REVIEWS

“I liked the movie”

positive

“It’s a good movie. Nice story”

positive

“Nice songs. But sadly boring ending. ”

negative

“Hero’s acting is bad but heroine looks good. Overall nice movie”

positive

“Sad, boring movie”

negative

TEXT

REVIEWS

“ilikedthemovi”

positive

“itsagoodmovienicestori”

positive

“nicesongsbutsadlyboringend”

negative

“herosactingisbadbutheroinelooksgoodoverallnicemovi”

positive

“sadboringmovi”

negative

P(positive | overall liked the movie) = P(overall liked the movie | positive) * P(positive) / P(overall liked the movie)

P(overall liked the movie | positive)* P(positive) with P(overall liked the movie | negative) * P(negative)

P(overall liked the movie) = P(overall) * P(liked) * P(the) * P(movie)

P(overall liked the movie| positive) = P(overall | positive) * P(liked | positive) * P(the | positive) * P(movie | positive)

WORD

P(WORD | POSITIVE)

P(WORD | NEGATIVE)

overall

1 + 1/17 + 21

0 + 1/7 + 21

liked

1 + 1/17 + 21

0 + 1/7 + 21

the

2 + 1/17 + 21

0 + 1/7 + 21

movie

3 + 1/17 + 21

1 + 1/7 + 21

P(overall | positive) * P(liked | positive) * P(the | positive) * P(movie | positive) * P(postive ) = 1.38 * 10^{-5} = 0.0000138

P(overall | negative) * P(liked | negative) * P(the | negative) * P(movie | negative) * P(negative) = 0.13 * 10^{-5} = 0.0000013

`#导入包 这里用到了NLTK`

`import` `pandas as pd`

`import` `re`

`import` `nltk`

`from` `nltk.corpus import` `stopwords`

`from` `nltk.stem.porter import` `PorterStemmer`

`from` `sklearn.feature_extraction.text import` `CountVectorizer`

`dataset =` `[["I liked the movie", "positive"],`

`["It’s a good movie. Nice story", "positive"],`

`["Hero’s acting is` `bad but heroine looks good.\`

`Overall nice movie", "positive"],`

`["Nice songs. But sadly boring ending.", "negative"],`

`["sad movie, boring movie", "negative"]]`

`dataset =` `pd.DataFrame(dataset)`

`dataset.columns =` `["Text", "Reviews"]`

`nltk.download('stopwords')`

`corpus =` `[]`

`for` `i in` `range(0, 5):`

`text =` `re.sub('[^a-zA-Z]', '', dataset['Text'][i])`

`text =` `text.lower()`

`text =` `text.split()`

`ps =` `PorterStemmer()`

`text =` `''.join(text)`

`corpus.append(text)`

`# 创建单词模型库`

`cv =` `CountVectorizer(max_features =` `1500)`

`X =` `cv.fit_transform(corpus).toarray()`

`y =` `dataset.iloc[:, 1].values`

`# 分隔数据设置训练数据和测试数据`

`from` `sklearn.cross_validation import` `train_test_split`

`X_train, X_test, y_train, y_test =` `train_test_split(`

`X, y, test_size =` `0.25, random_state =` `0)`

`# 使用朴素贝叶斯高斯分布训练数据`

`from` `sklearn.naive_bayes import` `GaussianNB`

`from` `sklearn.metrics import` `confusion_matrix`

`classifier =` `GaussianNB();`

`classifier.fit(X_train, y_train)`

`# 预测测试结果`

`y_pred =` `classifier.predict(X_test)`

`# 制作混乱矩阵`

`cm =` `confusion_matrix(y_test, y_pred)`

`cm`

