[007] 这份关于Python可视化的秘笈请收好！

Sam Gor

发布于 2020-11-23 13:02:40

3250

发布于 2020-11-23 13:02:40

文章被收录于专栏：SAMshare

“作者总结了用Python进行EDA可视化的常用demo，同时也有一个案例带着我们走了一遍，代码可以复用，涉及了常见的图表，包括折线图、条形图、柱状图、堆积图、饼图等，可以简单阅读，然后收藏起来备用哦！

Introduction

Exploratory Data Analysis — EDA is an indispensable step in data mining. To interpret various aspects of a data set like its distribution, principal or interference, it is necessary to visualize our data in different graphs or images. Fortunately, Python offers a lot of libraries to make visualization more convenient and easier than ever. Some of which are widely used today such as Matplotlib, Seaborn, Plotly or Bokeh.

Since my job concentrates on scrutinizing all angles of data, I have been exposed to many types of graphs. However, because there are way too many functions and the codes are not easy to remember, I sometimes forget the syntax and have to review or search for similar codes on the Internet. Without doubt, it has wasted a lot of my time, hence my motivation for writing this article. Hopefully, it can be a small help to anyone who has a memory of a goldfish like me.

Data Description

My dataset is downloaded from public Kaggle dataset. It is a grocery dataset, and you can easily get the data from the link below:

Groceries datasetDataset of 38765 rows for Market Basket Analysiswww.kaggle.com

数据集地址：https://www.kaggle.com/heeraldedhia/groceries-dataset

This grocery data consists of 3 columns, which are:

Member_number: id numbers of customers
Date: date of purchasing
itemDescription: Item name

Install necessary packages

There are some packages that we should import first.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Visualize data

Line Chart

For this section, I will use a line graph to visualize sales the grocery store during the time of 2 years 2014 and 2015.

First, I will transform the data frame a bit to get the items counted by month and year.

#Get the month and year
df['year'] = pd.DatetimeIndex(df['Date']).year
df['month'] = pd.DatetimeIndex(df['Date']).month
df['month_year'] = pd.to_datetime(df['Date']).dt.to_period('M')
#Group items counted by month-year
group_by_month = df1.groupby('month_year').agg({'Member_number':'nunique'}).reset_index()
#Sort the month-year by time order
group_by_month = group_by_month.sort_values(by = ['month_year'])
group_by_month['month_year'] = group_by_month['month_year'].astype('str')
group_by_month.head()

After we have our data, let’s try to visualize it:

#Create subplot
sns.set_style('whitegrid')
fig,ax=plt.subplots(figsize=(16,7))
#Create lineplot
chart=sns.lineplot(x=group_by_month['month_year'], y=group_by_month['Member_number'],ax=ax)
sns.despine(left=True)
#Customize chart
chart.set_xlabel('Period',weight='bold',fontsize=13)
chart.set_ylabel('Total Unique Customer', weight='bold',fontsize=13)
chart.set_title('Monthly Unique Customers',weight='bold',fontsize=16)
chart.set_xticklabels(group_by_month['month_year'], rotation = 45, ha="right")

ymin, ymax = ax.get_ylim()
bonus = (ymax - ymin)/28# still hard coded bonus but scales with the data
for x, y, name in zip(group_by_month['month_year'], group_by_month['Member_number'], group_by_month['Member_number'].astype('str')):
    ax.text(x, y + bonus, name, color = 'black', ha='center')

Bar Chart

Bar chart is used to simulate the changing trend of objects over time or to compare the figures / factors of objects. Bar charts usually have two axes: one axis is the object / factor that needs to be analyzed, the other axis is the parameters of the objects.

For this dataset, I will use a bar chart to visualize 10 best categories sold in 2014 and 2015. You can either display it by horizontal or vertical bar chart. Let’s see how it looks.

Data Transformation

#Count and group by category
category = df1.groupby('itemDescription').agg({'Member_number':'count'}).rename(columns={'Member_number':'total sale'}).reset_index()
#Get 10 first categories
category2 = category.sort_values(by=['total sale'], ascending = False).head(10)
category2.head()

Horizontal Bar Chart

#Horizontal barchart
#Create subplot
sns.set_style('whitegrid') #set theme
fig,ax=plt.subplots(figsize=(16,7))
#Create barplot
chart2 = sns.barplot(x=category2['total sale'],y=category2['itemDescription'], palette=sns.cubehelix_palette(len(x)))
#Customize chart
chart2.set_xlabel('Total Sale',weight='bold',fontsize=13)
chart2.set_ylabel('Item Name', weight='bold',fontsize=13)
chart2.set_title('Best Sellers',weight='bold',fontsize=16)
#Value number on chart: https://stackoverflow.com/questions/49820549/labeling-horizontal-barplot-with-values-in-seaborn
for p in ax.patches:
    width = p.get_width()    # get bar length
    ax.text(width + 1,       # set the text at 1 unit right of the bar
            p.get_y() + p.get_height() / 2, # get Y coordinate + X coordinate / 2
            '{:1.0f}'.format(width), # set variable to display
            ha = 'left',   # horizontal alignment
            va = 'center')  # vertical alignment

If you prefer vertical bar chart, try this:

#Vertical Barchart
#Create subplot
sns.set_style('whitegrid') #set theme
fig,ax=plt.subplots(figsize=(16,7))
#Create barplot
chart2 = sns.barplot(x=category2['itemDescription'],y=category2['total sale'], palette=sns.cubehelix_palette(len(x)))
#Customize chart
chart2.set_ylabel('Total Sale',weight='bold',fontsize=13)
chart2.set_xlabel('Item Name', weight='bold',fontsize=13)
chart2.set_title('Best Sellers',weight='bold',fontsize=16)
sns.despine()
#Value number on chart
for p in ax.patches: 
    height =p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
        height + 3,
        '{:1.0f}'.format(height),
        ha="center", fontsize=12)

Bar Chart with Hue Value

If you want to compare each category’s sales by year, what would your visualization look like? You can draw the graph with an addition of an element called hue value.

#transform dataset
month_sale = df.groupby(['month','year']).agg({'Member_number':'count'}).rename(columns={'Member_number':'Sales'}).reset_index()
#Create subplot
sns.set_style('whitegrid') #set theme
fig,ax=plt.subplots(figsize=(20,10))
#Create barplot
chart3 = sns.barplot(data=month_sale, x='month',y='Sales', hue ='year', palette = 'Paired')
#Customize chart
chart3.set_ylabel('Sales',weight='bold',fontsize=13)
chart3.set_xlabel('Month', weight='bold',fontsize=13)
chart3.set_title('Monthly Sales by Year',weight='bold',fontsize=16)
chart3.legend(loc='upper right', fontsize =16)
sns.despine(left = False)

#Create value label on bar chart
totals = month_sale['month'].value_counts()
n_hues = month_sale['year'].unique().size

temp_totals = totals.values.tolist()*n_hues
for p,t in zip(ax.patches,temp_totals):
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
        height + 3,
        '{:1.0f}'.format(height/t),
        ha="center", fontsize=12)

Now, can you see it more clearly?

Histogram

Imagine that I want to discover the frequency of customers buying whole milk, the best seller category. I will use histogram to obtain this information.

#Data transformation
whole_milk = df[['Member_number','itemDescription','year']][df['itemDescription'] == 'whole milk'].groupby(['Member_number','year']).agg({'itemDescription':'count'}).reset_index()
#Create subplot
sns.set_style('whitegrid') #set theme
fig,ax=plt.subplots(figsize =(15,7))
#Create displot
a = whole_milk[whole_milk.year == 2014]
sns.distplot(a['itemDescription'],hist = True,kde = False, label='2014',bins= range(1,10),color = 'black' )
a = whole_milk[whole_milk.year == 2015]
sns.distplot(a['itemDescription'],hist = True,kde = False, label='2015',bins=range(1,10), color = 'green')
#Plot formatting
ax.set_xticks(range(1,10))
plt.legend(prop={'size': 12})
plt.title('Frequency of Whole Milk Purchase', weight='bold',fontsize = 16)
plt.xlabel('Number of Purchases',weight = 'bold',fontsize = 13)
plt.ylabel('Number of Customers',weight = 'bold',fontsize = 13)

By looking at the visualization, we can see that customers hardly repurchase this item more than twice, and a lot of customers cease to buy this product after their first purchases.

Pie chart

Actually, pie charts are quite poor at communicating the data. However, it does not hurt to learn this visualization technique.

For this data, I want to compare the sales of top 10 categories with the rest in both year 2014 and 2015. Now, let’s transform our data to get this information visualized.

#Get the list of top 10 categories
list = category2.itemDescription.to_list()
#Get top 10 vs the rest
category_by_year = df1.groupby(['itemDescription','year']).agg({'Member_number':'count'}).rename(columns={'Member_number':'total sale'}).reset_index()
category_by_year['classification']= np.where(category_by_year.itemDescription.isin(list),'Top 10', 'Not Top 10')
top10_vs_therest = category_by_year.groupby(['classification','year']).agg({'total sale':'sum'}).reset_index()

Our data is now ready. Let’s see the pies!

#Create subplot
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(16,10))
#select color for each sections of
colors = ['lightskyblue', 'lightcoral']
#Plot the 1st pie
labels = labels=top10_vs_therest[(top10_vs_therest.year == 2014)]['classification']
explode_list = [0.1, 0]
values = top10_vs_therest[(top10_vs_therest.year == 2014)]['total sale']
ax1.pie(values,labels = labels,colors = colors, explode = explode_list,autopct = '%1.1f%%') 
ax1.set_title('Sales of Top 10 Categories in 2014', fontsize = 13, weight = 'bold')
ax1.legend(prop={'size': 12}, loc = 'upper right')

#Plot the 2nd pie
labels2 = labels=top10_vs_therest[(top10_vs_therest.year == 2015)]['classification']
values2 = top10_vs_therest[(top10_vs_therest.year == 2015)]['total sale']
ax2.pie(values2,labels = labels2,colors = colors, explode = explode_list, autopct = '%1.1f%%') 
ax2.set_title('Sales Top 10 categories in 2015', fontsize = 13, weight = 'bold')
ax2.legend(prop={'size': 12})

# plt.title('Top 10 Sales in 2014 vs 2015', weight = 'bold', fontsize = 15)
plt.show()

So, it is obvious that top 10 categories were less purchased in 2015 compared to 2014, by 5.5%.

Swarm Plot

Another way to review your data is swarm plot. In swarm plot, points are adjusted (vertical classification only) so that they do not overlap. This is helpful as it complements box plot when you want to display all observations along with some representation of the underlying distribution.

As I want to see the number of items sold in each day of the week, I may use this type of chart to display the information. As usual, let’s first calculate the items sold and group them by categories and days.

#Extract day of week from date time
import datetime 
import calendar 
  
def findDay(date): 
    born = datetime.datetime.strptime(date, '%d-%m-%Y').weekday() 
    return (calendar.day_name[born]) 

df1['day'] = df1['Date'].apply(lambda x: findDay(x))
#Group the data by day
df_day = df1.groupby(['day','itemDescription']).agg({'Member_number':'count'}).rename(columns={'Member_number':'Sales'}).reset_index()
dows = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
df_day['day'] = pd.Categorical(df_day['day'], categories=dows, ordered=True)
df_day2 = df_day.sort_values('day')
df_day2.head()

After we obtain the data, let’s see how the graph looks like.

sns.set_style('dark') #set theme
#Create subplot
fig,ax = plt.subplots(figsize=(16,10))
#Plot the swarm
chart5 = sns.swarmplot(x="day", y="Sales", data=df_day2)
#Label
chart5.set_ylabel('Sales',weight='bold',fontsize=13)
chart5.set_xlabel('Day of Week', weight='bold',fontsize=13)
chart5.set_title('Sales by Day of Week',weight='bold',fontsize=16)

Conclude

In this article, I have shown you how to customize your data with different types of visualizations. If you find it helpful, you can save it and review anytime you want. It can save you tons of time down the road. :D

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2020-11-19，如有侵权请联系 cloudcommunity@tencent.com 删除

python