# 记一次美妙的数据分析之旅~

1 `创建DataFrame`，转换长数据为宽数据；2 `导入数据`；3 `处理组合值`；4 `索引列`；5 `连接两个表`；6 `按列筛选`

7 按照字段`分组`；8 按照字段`排序`；9 分组后使用`聚合函数`；10 绘制频率`分布直方图`绘制；11 `最小抽样量`的计算方法；12 `数据去重`；13 `结果分析`

```import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pyecharts.charts import Bar,Grid,Line
import pyecharts.options as opts
from pyecharts.globals import ThemeType```

#### 1 创建DataFrame

pandas中一个dataFrame实例：

```Out[89]:
a  val
0  apple1  1.0
1  apple2  2.0
2  apple3  3.0
3  apple4  4.0
4  apple5  5.0```

```a  apple1  apple2  apple3  apple4  apple5
0     1.0     2.0     3.0     4.0     5.0```

```In [113]: pd.DataFrame(index=[0],columns=df.a,data=dict(zip(df.a,df.val)))
Out[113]:
a  apple1  apple2  apple3  apple4  apple5
0     1.0     2.0     3.0     4.0     5.0```

```In [116]: dict(zip(df.a,df.val))
Out[116]: {'apple1': 1.0, 'apple2': 2.0, 'apple3': 3.0, 'apple4': 4.0, 'apple5': 5.0}```

#### 2 导入数据

1. movies.dat
2. ratings.dat
3. users.dat

`movies.dat`包括三个字段：['Movie ID', 'Movie Title', 'Genre']

```import pandas as pd

movies = pd.read_csv('./data/movietweetings/movies.dat', delimiter='::', engine='python', header=None, names = ['Movie ID', 'Movie Title', 'Genre'])```

```   Movie ID                                        Movie Title  \
0         8      Edison Kinetoscopic Record of a Sneeze (1894)
1        10                La sortie des usines Lumi猫re (1895)
2        12                      The Arrival of a Train (1896)
3        25  The Oxford and Cambridge University Boat Race ...
4        91                         Le manoir du diable (1896)
5       131                           Une nuit terrible (1896)
6       417                      Le voyage dans la lune (1902)
7       439                     The Great Train Robbery (1903)
8       443        Hiawatha, the Messiah of the Ojibway (1903)
9       628                    The Adventures of Dollie (1908)
Genre
0                             Documentary|Short
1                             Documentary|Short
2                             Documentary|Short
3                                           NaN
4                                  Short|Horror
5                           Short|Comedy|Horror
7                    Short|Action|Crime|Western
8                                           NaN
9                                  Action|Short```

`users.dat`:

```users = pd.read_csv('./data/movietweetings/users.dat', delimiter='::', engine='python', header=None, names = ['User ID', 'Twitter ID'])

```   User ID  Twitter ID
0        1   397291295
1        2    40501255
2        3   417333257
3        4   138805259
4        5  2452094989
5        6   391774225
6        7    47317010
7        8    84541461
8        9  2445803544
9       10   995885060```

`rating.data`:

```ratings = pd.read_csv('./data/movietweetings/ratings.dat', delimiter='::', engine='python', header=None, names = ['User ID', 'Movie ID', 'Rating', 'Rating Timestamp'])

```   User ID  Movie ID  Rating  Rating Timestamp
0        1    111161      10        1373234211
1        1    117060       7        1373415231
2        1    120755       6        1373424360
3        1    317919       6        1373495763
4        1    454876      10        1373621125
5        1    790724       8        1374641320
6        1    882977       8        1372898763
7        1   1229238       9        1373506523
8        1   1288558       5        1373154354
9        1   1300854       8        1377165712```

Kaggle电影数据集第一节，我们使用数据处理利器 `pandas`， 函数`read_csv` 导入给定的三个数据文件。

```import pandas as pd

movies = pd.read_csv('./data/movietweetings/movies.dat', delimiter='::', engine='python', header=None, names = ['Movie ID', 'Movie Title', 'Genre'])
ratings = pd.read_csv('./data/movietweetings/ratings.dat', delimiter='::', engine='python', header=None, names = ['User ID', 'Movie ID', 'Rating', 'Rating Timestamp'])```

#### 3 处理组合值

`movies`字段`Genre`表示电影的类型，可能有多个值，分隔符为`|`，取值也可能为`None`.

`mask = movies.Genre.str.contains('comedy',case=False,na=False)`

case为 False，表示对大小写不敏感；na Genre列某个单元格为`NaN`时，我们使用的充填值，此处填充为`False`

```0    False
1    False
2    False
3    False
4    False
5     True
6     True
7    False
8    False
9    False
Name: Genre, dtype: bool```

#### 4 访问某列

```comedy = movies[mask]
comdey_ids = comedy['Movie ID']```

```5      131
6      417
15    2354
18    3863
19    4099
20    4100
21    4101
22    4210
23    4395
25    4518
Name: Movie ID, dtype: int64```

1-4介绍`数据读入``处理组合值``索引数据`等, pandas中使用较多的函数，基于Kaggle真实电影影评数据集，最后得到所有`喜剧 ID`

```5      131
6      417
15    2354
18    3863
19    4099
20    4100
21    4101
22    4210
23    4395
25    4518
Name: Movie ID, dtype: int64```

#### 5 连接两个表

```   User ID  Movie ID  Rating  Rating Timestamp
0        1    111161      10        1373234211
1        1    117060       7        1373415231
2        1    120755       6        1373424360
3        1    317919       6        1373495763
4        1    454876      10        1373621125
5        1    790724       8        1374641320
6        1    882977       8        1372898763
7        1   1229238       9        1373506523
8        1   1288558       5        1373154354
9        1   1300854       8        1377165712```

pandas 中使用`join`关联两张表，连接字段是`Movie ID`，如果顺其自然这么使用`join`

`combine = ratings.join(comedy, on='Movie ID', rsuffix='2')`

```combine = ratings.join(comedy.set_index('Movie ID'), on='Movie ID')

```   User ID  Movie ID  Rating  Rating Timestamp Movie Title Genre
0        1    111161      10        1373234211         NaN   NaN
1        1    117060       7        1373415231         NaN   NaN
2        1    120755       6        1373424360         NaN   NaN
3        1    317919       6        1373495763         NaN   NaN
4        1    454876      10        1373621125         NaN   NaN
5        1    790724       8        1374641320         NaN   NaN
6        1    882977       8        1372898763         NaN   NaN
7        1   1229238       9        1373506523         NaN   NaN
8        1   1288558       5        1373154354         NaN   NaN
9        1   1300854       8        1377165712         NaN   NaN```

Genre列为`NaN`表明，这不是喜剧。需要筛选出此列不为`NaN` 的记录。

#### 6 按列筛选

pandas最方便的地方，就是向量化运算，尽可能减少了for循环的嵌套。

`mask = pd.notnull(combine['Genre'])`

```result = combine[mask]

```    User ID  Movie ID  Rating  Rating Timestamp             Movie Title  \
12        1   1588173       9        1372821281      Warm Bodies (2013)
13        1   1711425       3        1372604878        21 & Over (2013)
14        1   2024432       8        1372703553   Identity Thief (2013)
17        1   2101441       1        1372633473  Spring Breakers (2012)
28        2   1431045       7        1457733508         Deadpool (2016)

Genre
12           Comedy|Horror|Romance
13                          Comedy
17              Comedy|Crime|Drama

```    User ID  Movie ID  Rating  Rating Timestamp             Movie Title  \
12        1   1588173       9        1372821281      Warm Bodies (2013)
13        1   1711425       3        1372604878        21 & Over (2013)
14        1   2024432       8        1372703553   Identity Thief (2013)
17        1   2101441       1        1372633473  Spring Breakers (2012)
28        2   1431045       7        1457733508         Deadpool (2016)

Genre
12           Comedy|Horror|Romance
13                          Comedy
17              Comedy|Crime|Drama

#### 7 按照Movie ID 分组

result中会有很多观众对同一部电影的打分，所以要求得分前10的喜剧，先按照`Movie ID`分组，然后求出平均值：

`score_as_movie = result.groupby('Movie ID').mean()`

```               User ID  Rating  Rating Timestamp
Movie ID
131       34861.000000     7.0      1.540639e+09
417       34121.409091     8.5      1.458680e+09
2354       6264.000000     8.0      1.456343e+09
3863      43803.000000    10.0      1.430439e+09
4099      25084.500000     7.0      1.450323e+09```

#### 8 按照电影得分排序

```score_as_movie.sort_values(by='Rating', ascending = False,inplace=True)
score_as_movie```

```	User ID	Rating	Rating Timestamp
Movie ID
7134690	30110.0	10.0	1.524974e+09
416889	1319.0	10.0	1.543320e+09
57840	23589.0	10.0	1.396802e+09
5693562	50266.0	10.0	1.511024e+09
5074	43803.0	10.0	1.428352e+09```

#### 9 分组后使用聚合函数

```watchs = result.groupby('Movie ID').agg(['count'])
watchs2 = watchs['Rating']['count']```

`print(watchs2.head(20))`

```Movie ID
131      1
417     22
2354     1
3863     1
4099     2
4100     1
4101     1
4210     1
4395     1
4518     1
4546     2
4936     2
5074     1
5571     1
6177     1
6414     3
6684     1
6689     1
7145     1
7162     2
Name: count, dtype: int64```

`watchs2.describe()`

```count    10740.000000
mean        20.192086
std         86.251411
min          1.000000
25%          1.000000
50%          2.000000
75%          7.000000
max       1843.000000
Name: count, dtype: float64```

#### 10 频率分布直方图

```fig = plt.figure(figsize=(12,8))
histn = plt.hist(watchs2[watchs2 <=19],19,histtype='step')
plt.scatter([i+1 for i in range(len(histn[0]))],histn[0])```

`histn`元祖表示个数和对应的被分割的区间，查看`histn[0]`:

```array([4383., 1507.,  787.,  541.,  356.,  279.,  209.,  163.,  158.,
118.,  114.,   90.,  104.,   81.,   80.,   73.,   62.,   65.,
52.])```
`sum(histn[0]) # 9222`

#### 11 最小抽样量

```n3 = result.groupby('Movie ID').agg(['count','mean','std'])
n3r = n3[n3['Rating']['count']>=20]['Rating']```

```	count	mean	std
Movie ID
417	22	8.500000	1.263027
12349	68	8.485294	1.227698
15324	20	8.350000	1.039990
15864	51	8.431373	1.374844
17925	44	8.636364	1.259216```

`nmin = (1.96**2*n3r['std']**2) / ( (n3r['mean']*0.025)**2 )`

`nmin`前5行：

```Movie ID
417         135.712480
12349       128.671290
15324        95.349276
15864       163.434005
17925       130.668350```

`n3s = n3r[ n3r['count'] >= nmin ]`

```count	mean	std
Movie ID
53604	129	8.635659	1.230714
57012	207	8.449275	1.537899
70735	224	8.839286	1.190799
75686	209	8.095694	1.358885
88763	296	8.945946	1.026984
...	...	...	...
6320628	860	7.966279	1.469924
6412452	276	7.510870	1.389529
6662050	22	10.000000	0.000000
6966692	907	8.673649	1.286455
7131622	1102	7.851180	1.751500
173 rows × 3 columns```

#### 12 去重和连表

`n3s_sort = n3s.sort_values(by='mean',ascending=False)`

```	count	mean	std
Movie ID
6662050	22	10.000000	0.000000
4921860	48	10.000000	0.000000
5262972	28	10.000000	0.000000
5512872	353	9.985836	0.266123
3863552	199	9.010050	1.163372
...	...	...	...
1291150	647	6.327666	1.785968
2557490	546	6.307692	1.858434
1478839	120	6.200000	0.728761
2177771	485	6.150515	1.523922
1951261	1091	6.083410	1.736127
173 rows × 3 columns```

`n3s_drops = n3s_sort.drop_duplicates(subset=['count'])`

```	count	mean	std
Movie ID
6662050	22	10.000000	0.000000
4921860	48	10.000000	0.000000
5262972	28	10.000000	0.000000
5512872	353	9.985836	0.266123
3863552	199	9.010050	1.163372
...	...	...	...
1291150	647	6.327666	1.785968
2557490	546	6.307692	1.858434
1478839	120	6.200000	0.728761
2177771	485	6.150515	1.523922
1951261	1091	6.083410	1.736127
157 rows × 3 columns```

```ms = movies.drop_duplicates(subset=['Movie ID'])
ms = ms.set_index('Movie ID')
n3s_final = n3s_drops.join(ms,on='Movie ID')```

#### 13 结果分析

```Movie Title
Five Minutes (2017)
MSG 2 the Messenger (2015)
Avengers: Age of Ultron Parody (2015)
Be Somebody (2016)
Bajrangi Bhaijaan (2015)
Back to the Future (1985)
La vita bella (1997)
The Intouchables (2011)
The Sting (1973)
Coco (2017)
Toy Story 3 (2010)
3 Idiots (2009)
Green Book (2018)
The Apartment (1960)
P.K. (2014)
The Truman Show (1998)
Am鑼卨ie (2001)
Inside Out (2015)
Toy Story 4 (2019)
Toy Story (1995)
Finding Nemo (2003)
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)
Home Alone (1990)
Zootopia (2016)
Up (2009)
Monsters, Inc. (2001)
La La Land (2016)
Relatos salvajes (2014)
En man som heter Ove (2015)
Snatch (2000)
Lock, Stock and Two Smoking Barrels (1998)
How to Train Your Dragon 2 (2014)
As Good as It Gets (1997)
Guardians of the Galaxy (2014)
The Grand Budapest Hotel (2014)
Fantastic Mr. Fox (2009)
Silver Linings Playbook (2012)
Sing Street (2016)
Annie Hall (1977)
Pride (2014)
In Bruges (2008)
Big Hero 6 (2014)
Groundhog Day (1993)
The Breakfast Club (1985)
Little Miss Sunshine (2006)
The Terminal (2004)```

```x = n3s_final['Movie Title'][:10].tolist()[::-1]
y = n3s_final['count'][:10].tolist()[::-1]
bar = (
Bar()
.reversal_axis()
.set_global_opts(title_opts=opts.TitleOpts(title="喜剧电影被评论次数"),
toolbox_opts=opts.ToolboxOpts(),)
)
grid = (
Grid(init_opts=opts.InitOpts(theme=ThemeType.LIGHT))
)
grid.render_notebook()```

```x = n3s_final['Movie Title'][:10].tolist()[::-1]
y = n3s_final['mean'][:10].round(3).tolist()[::-1]
bar = (
Bar()
.reversal_axis()
.set_global_opts(title_opts=opts.TitleOpts(title="喜剧电影平均得分"),
xaxis_opts=opts.AxisOpts(min_=8.0,name='平均得分'),
toolbox_opts=opts.ToolboxOpts(),)
)
grid = (
Grid(init_opts=opts.InitOpts(theme=ThemeType.MACARONS))
)
grid.render_notebook()```

0 条评论

• ### 深入浅出实战：说话人识别

由于语音特征的特别，我们很难在机器学习入门的文章中看到关于语音的案例或者实验，本文主要介绍说话人识别的大体流程与原理，不在具体的细节公式上做过多讨论（因为实在是...

• ### 图解用栈数据结构对树的遍历

本公众号主要推送关于如何构思算法使之应用到我们的工作中。计算机常用算法思想大致来说有，分而治之，动态规划，贪心算法，搜索算法，回溯 ，训练这些思维的一个很好的平...

• ### 均分纸牌（经典贪心）

有N堆纸牌，编号分别为1,2,…,N。每堆上有若干张，但纸牌总数必为N的倍数。可以在任一堆上取若干张纸牌，然后移动。

• ### 【干货】MySQL 分库分表及其平滑扩容方案

众所周知，数据库很容易成为应用系统的瓶颈。单机数据库的资源和处理能力有限，在高并发的分布式系统中，可采用分库分表突破单机局限。本文总结了分库分表的相关概念、全局...

• ### 网络拓扑图上文本的巧妙应用

在前端网页设计中，文本是重要的组成部分，那么在网络拓扑图中也是一样的，文本在网络拓扑图上最基本的显示功能之一，在不同的应用场景下，会有不同的需求。但是不同的需求...

• ### SpringMVC实现发送邮件

跑单元测试的时候报：Could not resolve placeholder异常，不可以解析email.protocol

最近WordPress界出了条新闻：博客平台Wordpress网站遭遇大规模暴力破解攻击（原文附后）。看到这条消息，我立马到空间后台查看了下，发现确实是有很多来...

• ### 业界 | 谷歌最新语义图像分割模型DeepLab-v3+今日开源

选自Google Research Blog 作者：Liang-Chieh Chen、Yukun Zhu 机器之心编译 参与：刘晓坤、路雪 刚刚，谷歌开源了语义...

• ### 硬件对数据库性能优化带来的影响

数据库性能优化不是一个简单的任务，不仅仅是SQL层面的优化，它的关键在于对innodb存储引擎的了解，当然，好的存储引擎性能离不开好的硬件系统的支撑，这...