# 独家 | 探索性文本数据分析的新手教程（Amazon案例研究）

-探索性数据分析是探索数据、形成见解、检验假设、检查预设条件并揭示数据中潜在的隐藏规律的过程。

-（Exploratory Data Analysis is the process of exploring data, generating insights, testing hypotheses, checking assumptions and revealing underlying hidden patterns in the data.）

https://courses.analyticsvidhya.com/courses/tableau-2-0

NLP简介（免费课程）：

https://courses.analyticsvidhya.com/courses/Intro-to-NLP

• 理解问题的设定
• 基本的文本数据预处理
• 用Python清洗文本数据
• 为探索性数据分析（EDA）准备文本数据
• 基于Python的Amazon产品评论探索性数据分析

https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products

```import numpy as np
import pandas as pd
# 可视化
import matplotlib.pyplot as plt
# 正则化
import re
# 处理字符串
import string
# 执行数学运算
import math

# 导入数据

```df=df[['name','reviews.text','reviews.doRecommend','reviews.numHelpful']]
print("Shape of data=>",df.shape)

`df.isnull().sum() `

```df.dropna(inplace=True)
df.isnull().sum()  ```

https://www.analyticsvidhya.com/blog/2020/03/what-are-lambda-functions-in-python/

```df=df.groupby('name').filter(lambda x:len(x)>500).reset_index(drop=True)
print('Number of products=>',len(df['name'].unique()))  ```

```df['reviews.doRecommend']=df['reviews.doRecommend'].astype(int)

https://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/

`df['name'].unique()`

`df['name']=df['name'].apply(lambda x: x.split(',,,')[0])`

``` for index,text in enumerate(df['reviews.text'][35:40]):
print('Review %d:\n'%(index+1),text)  ```

• 扩展缩略语；
• 将评论文本小写；
• 删除数字和包含数字的单词；
• 删除标点符号。

``` # Dictionary of English Contractions
contractions_dict = { "ain't": "are not","'s":" is","aren't": "are not",
"can't": "cannot","can't've": "cannot have",
"'cause": "because","could've": "could have","couldn't": "could not",
"couldn't've": "could not have", "didn't": "did not","doesn't": "does not",
"hasn't": "has not","haven't": "have not","he'd": "he would",
"he'd've": "he would have","he'll": "he will", "he'll've": "he will have",
"how'd": "how did","how'd'y": "how do you","how'll": "how will",
"I'd": "I would", "I'd've": "I would have","I'll": "I will",
"I'll've": "I will have","I'm": "I am","I've": "I have", "isn't": "is not",
"it'd": "it would","it'd've": "it would have","it'll": "it will",
"it'll've": "it will have", "let's": "let us","ma'am": "madam",
"mayn't": "may not","might've": "might have","mightn't": "might not",
"mightn't've": "might not have","must've": "must have","mustn't": "must not",
"mustn't've": "must not have", "needn't": "need not",
"needn't've": "need not have","o'clock": "of the clock","oughtn't": "ought not",
"oughtn't've": "ought not have","shan't": "shall not","sha'n't": "shall not",
"shan't've": "shall not have","she'd": "she would","she'd've": "she would have",
"she'll": "she will", "she'll've": "she will have","should've": "should have",
"shouldn't": "should not", "shouldn't've": "should not have","so've": "so have",
"that'd": "that would","that'd've": "that would have", "there'd": "there would",
"there'd've": "there would have", "they'd": "they would",
"they'd've": "they would have","they'll": "they will",
"they'll've": "they will have", "they're": "they are","they've": "they have",
"to've": "to have","wasn't": "was not","we'd": "we would",
"we'd've": "we would have","we'll": "we will","we'll've": "we will have",
"we're": "we are","we've": "we have", "weren't": "were not","what'll": "what will",
"what'll've": "what will have","what're": "what are", "what've": "what have",
"when've": "when have","where'd": "where did", "where've": "where have",
"who'll": "who will","who'll've": "who will have","who've": "who have",
"why've": "why have","will've": "will have","won't": "will not",
"won't've": "will not have", "would've": "would have","wouldn't": "would not",
"wouldn't've": "would not have","y'all": "you all", "y'all'd": "you all would",
"y'all'd've": "you all would have","y'all're": "you all are",
"y'all've": "you all have", "you'd": "you would","you'd've": "you would have",
"you'll": "you will","you'll've": "you will have", "you're": "you are",
"you've": "you have"}

# Regular expression for finding contractions
contractions_re=re.compile('(%s)' % '|'.join(contractions_dict.keys()))

# Function for expanding contractions
def expand_contractions(text,contractions_dict=contractions_dict):
def replace(match):
return contractions_dict[match.group(0)]
return contractions_re.sub(replace, text)

# Expanding Contractions in the reviews
df['reviews.text']=df['reviews.text'].apply(lambda x:expand_contractions(x))  ```

Python正则表达式初学者教程

https://www.analyticsvidhya.com/blog/2015/06/regular-expression-python/

https://www.analyticsvidhya.com/blog/2017/03/extracting-information-from-reports-using-regular-expressons-library-in-python/

https://www.analyticsvidhya.com/blog/2020/01/4-applications-of-regular-expressions-that-every-data-scientist-should-know-with-python-code/

`df['cleaned']=df['reviews.text'].apply(lambda x: x.lower())`

`df['cleaned']=df['cleaned'].apply(lambda x: re.sub('\w*\d\w*','', x))`

`df['cleaned']=df['cleaned'].apply(lambda x: re.sub('[%s]' % re.escape(string.punctuation), '', x))`

``` # Removing extra spaces
df['cleaned']=df['cleaned'].apply(lambda x: re.sub(' +',' ',x))  ```

``` for index,text in enumerate(df['cleaned'][35:40]):
print('Review %d:\n'%(index+1),text)  ```

• 删除停用词；
• 词形还原；
• 创建文档术语矩阵。

NLP要点：在Python中使用NLTK和spaCy来删除停用词与规范化文本：

https://www.analyticsvidhya.com/blog/2019/08/how-to-remove-stopwords-text-normalization-nltk-spacy-gensim-python/

``` # Importing spacy
import spacy

# Lemmatization with stopwords removal
df['lemmatized']=df['cleaned'].apply(lambda x: ' '.join([token.lemma_ for token in list(nlp(x)) if (token.is_stop==False)]))  ```

```df_grouped=df[['name','lemmatized']].groupby(by='name').agg(lambda x:' '.join(x))

https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf

``` # Creating Document Term Matrix
from sklearn.feature_extraction.text
import CountVectorizer cv=CountVectorizer(analyzer='word')
data=cv.fit_transform(df_grouped['lemmatized'])
df_dtm = pd.DataFrame(data.toarray(), columns=cv.get_feature_names())
df_dtm.index=df_grouped.index

```# Importing wordcloud for plotting word clouds and textwrap for wrapping longer text
from wordcloud import WordCloud
from textwrap import wrap

# Function for generating word clouds
def generate_wordcloud(data,title):
wc = WordCloud(width=400, height=330, max_words=150,colormap="Dark2").generate_from_frequencies(data)
plt.figure(figsize=(10,8))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.title('\n'.join(wrap(title,60)),fontsize=13)
plt.show()
# Transposing document term matrix
df_dtm=df_dtm.transpose()
# Plotting word cloud for each product
for index,product in enumerate(df_dtm.columns):
generate_wordcloud(df_dtm[product].sort_values(ascending=False),product)```

https://www.analyticsvidhya.com/blog/2018/07/hands-on-sentiment-analysis-dataset-python/

``` from textblob import TextBlob
df['polarity']=df['lemmatized'].apply(lambda x:TextBlob(x).sentiment.polarity) ```

```print("3 Random Reviews with Highest Polarity:")
for index,review in enumerate(df.iloc[df['polarity'].sort_values(ascending=False)[:3].index]['reviews.text']):
print('Review {}:\n'.format(index+1),review) ```
```print("3 Random Reviews with Lowest Polarity:")
for index,review in enumerate(df.iloc[df['polarity'].sort_values(ascending=True)[:3].index]['reviews.text']):
print('Review {}:\n'.format(index+1),review)```

``` product_polarity_sorted=pd.DataFrame(df.groupby('name')['polarity'].mean().sort_values(ascending=True))

plt.figure(figsize=(16,8))
plt.xlabel('Polarity')
plt.ylabel('Products')
plt.title('Polarity of Different Amazon Product Reviews')
polarity_graph=plt.barh(np.arange(len(product_polarity_sorted.index)),product_polarity_sorted['polarity'],color='purple',)
# Writing product names on bar
for bar,product in zip(polarity_graph,product_polarity_sorted.index):
plt.text(0.005,bar.get_y()+bar.get_width(),'{}'.format(product),va='center',fontsize=11,color='white')
# Writing polarity values on graph
for bar,polarity in zip(polarity_graph,product_polarity_sorted['polarity']):
plt.text(bar.get_width()+0.001,bar.get_y()+bar.get_width(),'%.3f'%polarity,va='center',fontsize=11,color='black')
plt.yticks([])
plt.show()```

``` recommend_percentage=pd.DataFrame(((df.groupby('name')['reviews.doRecommend'].sum()*100)/df.groupby('name')['reviews.doRecommend'].count()).sort_values(ascending=True))

plt.figure(figsize=(16,8))
plt.xlabel('Recommend Percentage')
plt.ylabel('Products')
plt.title('Percentage of reviewers recommended a product')
recommend_graph=plt.barh(np.arange(len(recommend_percentage.index)),recommend_percentage['reviews.doRecommend'],color='green')
# Writing product names on bar
for bar,product in zip(recommend_graph,recommend_percentage.index):
plt.text(0.5,bar.get_y()+0.4,'{}'.format(product),va='center',fontsize=11,color='white')
# Writing recommendation percentage on graph
for bar,percentage in zip(recommend_graph,recommend_percentage['reviews.doRecommend']):
plt.text(bar.get_width()+0.5,bar.get_y()+0.4,'%.2f'%percentage,va='center',fontsize=11,color='black')

plt.yticks([])
plt.show()  ```

``` recommend_percentage=pd.DataFrame(((df.groupby('name')['reviews.doRecommend'].sum()*100)/df.groupby('name')['reviews.doRecommend'].count()).sort_values(ascending=True))
import textstat
df['gunning_fog']=df['reviews.text'].apply(lambda x: textstat.gunning_fog(x))

print('Dale Chall Score of upvoted reviews=>',df[df['reviews.numHelpful']>1]['dale_chall_score'].mean())
print('Dale Chall Score of not upvoted reviews=>',df[df['reviews.numHelpful']<=1]['dale_chall_score'].mean())

print('Gunning Fog Index of upvoted reviews=>',df[df['reviews.numHelpful']>1]['gunning_fog'].mean())
print('Gunning Fog Index of not upvoted reviews=>',df[df['reviews.numHelpful']<=1]['gunning_fog'].mean())  ```

``` df['text_standard']=df['reviews.text'].apply(lambda x: textstat.text_standard(x))

print('Text Standard of not upvoted reviews=>',df[df['reviews.numHelpful']<=1]['text_standard'].mode())  ```

``` df['reading_time']=df['reviews.text'].apply(lambda x: textstat.reading_time(x))

• 顾客喜欢亚马逊的产品。它们令人满意同时易于使用；
• 亚马逊需要提升 Fire Kids Edition Tablet 这款产品，因为它的负面评论最多。它也是最不被推荐的产品；
• 大部分的评论都是用简单的英语写的，任何一个五六年级水平的人都很容易理解；
• 有用的评论的阅读时间是非有用评论的两倍，这意味着人们发现较长的评论更有帮助。

https://www.analyticsvidhya.com/blog/category/nlp/

https://courses.analyticsvidhya.com/courses/Intro-to-NLPNL

https://www.analyticsvidhya.com/blog/2020/01/learning-path-nlp-2020/

https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/

NLP实践者的预训练单词嵌入基本指南

https://www.analyticsvidhya.com/blog/2020/03/pretrained-word-embeddings-nlp/

https://www.analyticsvidhya.com/blog/2019/08/complete-list-important-frameworks-nlp/

A Beginner’s Guide to Exploratory Data Analysis (EDA) on Text Data (Amazon Case Study)

https://www.analyticsvidhya.com/blog/2020/04/beginners-guide-exploratory-data-analysis-text-data/

0 条评论

• ### Pandas进阶修炼120题，给你深度和广度的船新体验

本文为你介绍Pandas基础、Pandas数据处理、金融数据处理等方面的一些习题。

• ### 独家 | 基于NLP的COVID-19虚假新闻检测（附代码）

本文为大家介绍了基于自然语言处理的COVID-19虚假新闻检测方法以及可视化方法，并结合真实的新闻数据集与完整的代码复现了检测以及可视化的过程。

• ### 独家 | 11个Python Pandas小技巧让你的工作更高效（附代码实例）

Pandas是一个在Python中广泛应用的数据分析包。市面上有很多关于Pandas的经典教程，但本文介绍几个隐藏的炫酷小技巧，我相信这些会对你有所帮助。

• ### 20个能够有效提高 Pandas数据分析效率的常用函数，附带解释和例子

Pandas是一个受众广泛的python数据分析库。它提供了许多函数和方法来加快数据分析过程。pandas之所以如此普遍，是因为它的功能强大、灵活简单。本文将介...

• ### ggplot2|从0开始绘制PCA图

PCA(Principal Component Analysis)，即主成分分析方法，是一种使用最广泛的数据降维算法。在数据分析以及生信分析中会经常用到。

• ### Python数据分析--Pandas知识

利用drop_duplicates()函数删除数据表中重复多余的记录, 比如删除重复多余的ID.

• ### Python数据科学（六）- 资料清理(Ⅰ)1.Pandas1.资料筛选2.侦测遗失值3.补齐遗失值

成功爬取到我们所需要的数据以后，接下来应该做的是对资料进行清理和转换， 很多人遇到这种情况最自然地反应就是“写个脚本”，当然这也算是一个很好的解决方法，但是，p...

• ### Pandas进阶修炼120题，给你深度和广度的船新体验

本文为你介绍Pandas基础、Pandas数据处理、金融数据处理等方面的一些习题。

• ### Day05| 第四期-电商数据分析

疫情期间，想必我们会增加网上购物，人们的生活越来越数字化。当我们消费时，无论是线上和线下都会产生大量的交易数据，对于商家来说数字化的运营方式非常必要，从大量的交...

• ### Pandas 数据分析： 3 种方法实现一个实用小功能

与时间相关，自然第一感觉便是转化为datetime格式，这里需要注意：需要首先将两列转化为 str 类型。