实战:一文弄清百万级电影评分数据

精彩文章

文末免费领取500G干货教程

今日分享:电影评分数据

准备工作

本文所分析的电影评分数据:网站grouplens.org/datasets/movielens下载MovieLens 1M Dataset 即可。

同时须阅读说明:因为为操作方便,部分变量其实际意义是另外注明的,以便利于理解。

数据读取

以下数据的列名均是原数据集中的列名(具体什么意思,请看阅读须知),其中文件路径应根据实际的存放位置写入。

用户数据

In [7]: user_names = ['user_id','gender','age','occupation','zip']

In [8]: users = pd.read_table('F:/Anaconda个人文件/Jupyter/MaiZi_data/ml-1

...: m/users.dat',sep='::',header=None,names=user_names,engine='python'

...: )

评分数据

In [9]: rating_names = ['user_id', 'movie_id', 'rating', 'timestamp']

In [10]: ratings = pd.read_table('F:/Anaconda个人文件/Jupyter/MaiZi_data/m

...: l-1m/ratings.dat', sep='::', header=None, names=rating_names, eng

...: ine='python')

电影数据

In [11]: movie_names = ['movie_id', 'title', 'genres']

In [12]: movies = pd.read_table('F:/Anaconda个人文件/Jupyter/MaiZi_data/ml

...: -1m/movies.dat', sep='::', header=None, names=movie_names, engine

...: ='python')

简单查看三种数据的行数及输出

In [15]: print(len(users))

6040

In [16]: users.head(2)

Out[16]:

user_id gender age occupation zip

0 1 F 1 10 48067

1 2 M 56 16 70072

In [17]: print(len(ratings))

1000209

In [18]: ratings.head(2)

Out[18]:

user_id movie_id rating timestamp

In [19]: print(len(movies))

3883

In [20]: movies.head(2)

Out[20]:

movie_id title genres

0 1 Toy Story (1995) Animation|Children's|Comedy

1 2 Jumanji (1995) Adventure|Children's|Fantasy

数据合并

In [13]: data = pd.merge(pd.merge(users,ratings),movies)

In [14]: len(data)

Out[14]: 1000209

可以看出合并后的数据的行数取得是三者中的最大值,也就是6040个人共对3883个电影进行了1000209次评分,因为每个个体均有可能看多部电影及作出多个评价。

按性别查看各部电影的平均得分

In [23]: mean_ratings_gender = data.pivot_table(values='rating', index='ti

...: tle', columns='gender', aggfunc='mean')

In [24]: mean_ratings_gender.head(5)

Out[24]:

gender F M

title

$1,000,000 Duck (1971) 3.375000 2.761905

'Night Mother (1986) 3.388889 3.352941

'Til There Was You (1997) 2.675676 2.733333

'burbs, The (1989) 2.793478 2.962085

...And Justice for All (1979) 3.828571 3.689024

查看男女对相同电影的评分差别,间接反应出男性女性在电影所展示价值观上的冲突

In [25]: mean_ratings_gender['diff'] = mean_ratings_gender.F - mean_rating

...: s_gender.M

In [26]: mean_ratings_gender.head(2)

Out[26]:

gender F M diff

title

$1,000,000 Duck (1971) 3.375000 2.761905 0.613095

'Night Mother (1986) 3.388889 3.352941 0.035948

男女评分意见相差最大的前十部电影

In [27]: mean_ratings_gender.sort_values(by='diff', ascending=True).head(1

...: 0)

Out[27]:

gender F M diff

title

Tigrero: A Film That Was Never Made (1994) 1.0 4.333333 -3.333333

Neon Bible, The (1995) 1.0 4.000000 -3.000000

Enfer, L' (1994) 1.0 3.750000 -2.750000

Stalingrad (1993) 1.0 3.593750 -2.593750

Killer: A Journal of Murder (1995) 1.0 3.428571 -2.428571

Dangerous Ground (1997) 1.0 3.333333 -2.333333

In God's Hands (1998) 1.0 3.333333 -2.333333

Rosie (1998) 1.0 3.333333 -2.333333

Flying Saucer, The (1950) 1.0 3.300000 -2.300000

Jamaica Inn (1939) 1.0 3.142857 -2.142857

按照电影名称进行分组

In [28]: ratings_by_movie_title = data.groupby('title').size()

In [29]: ratings_by_movie_title.head(2)

Out[29]:

title

$1,000,000 Duck (1971) 37

'Night Mother (1986) 70

dtype: int64

参评人数超过1000的前十部电影排名

In [30]: top_ratings = ratings_by_movie_title[ratings_by_movie_title > 100

...: 0]

In [31]: top_10_ratings = top_ratings.sort_values(ascending=False).head(10

...: )

In [32]: top_10_ratings

Out[32]:

title

American Beauty (1999) 3428

Star Wars: Episode IV - A New Hope (1977) 2991

Star Wars: Episode V - The Empire Strikes Back (1980) 2990

Star Wars: Episode VI - Return of the Jedi (1983) 2883

Jurassic Park (1993) 2672

Saving Private Ryan (1998) 2653

Terminator 2: Judgment Day (1991) 2649

Matrix, The (1999) 2590

Back to the Future (1985) 2583

Silence of the Lambs, The (1991) 2578

dtype: int64

前二十部高分电影

In [33]: mean_ratings = data.pivot_table(values='rating', index='title', a

...: ggfunc='mean')

In [34]: top_20_mean_ratings = mean_ratings.sort_values(by='rating',ascend

...: ing=False).head(20)

In [35]: top_20_mean_ratings

Out[35]:

rating

title

Ulysses (Ulisse) (1954) 5.000000

Lured (1947) 5.000000

Follow the Bitch (1998) 5.000000

Bittersweet Motel (2000) 5.000000

Song of Freedom (1936) 5.000000

One Little Indian (1973) 5.000000

Smashing Time (1967) 5.000000

Schlafes Bruder (Brother of Sleep) (1995) 5.000000

Gate of Heavenly Peace, The (1995) 5.000000

Baby, The (1973) 5.000000

I Am Cuba (Soy Cuba/Ya Kuba) (1964) 4.800000

Lamerica (1994) 4.750000

Apple, The (Sib) (1998) 4.666667

Sanjuro (1962) 4.608696

Seven Samurai (The Magnificent Seven) (Shichini... 4.560510

Shawshank Redemption, The (1994) 4.554558

Godfather, The (1972) 4.524966

Close Shave, A (1995) 4.520548

Usual Suspects, The (1995) 4.517106

Schindler's List (1993) 4.510417

前十部参评人数超过1000的活跃电影平均评分

In [36]: mean_ratings.loc[top_10_ratings.index]

Out[36]:

rating

title

American Beauty (1999) 4.317386

Star Wars: Episode IV - A New Hope (1977) 4.453694

Star Wars: Episode V - The Empire Strikes Back ... 4.292977

Star Wars: Episode VI - Return of the Jedi (1983) 4.022893

Jurassic Park (1993) 3.763847

Saving Private Ryan (1998) 4.337354

Terminator 2: Judgment Day (1991) 4.058513

Matrix, The (1999) 4.315830

Back to the Future (1985) 3.990321

Silence of the Lambs, The (1991) 4.351823

前二十部评分最高电影的活跃程度即参评人数(评分较高的也许是由于参评人数少而造成的假象高分)

In [37]: ratings_by_movie_title.loc[top_20_mean_ratings.index]

Out[37]:

title

Ulysses (Ulisse) (1954) 1

Lured (1947) 1

Follow the Bitch (1998) 1

Bittersweet Motel (2000) 1

Song of Freedom (1936) 1

One Little Indian (1973) 1

Smashing Time (1967) 2

Schlafes Bruder (Brother of Sleep) (1995) 1

Gate of Heavenly Peace, The (1995) 3

Baby, The (1973) 1

I Am Cuba (Soy Cuba/Ya Kuba) (1964) 5

Lamerica (1994) 8

Apple, The (Sib) (1998) 9

Sanjuro (1962) 69

Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) 628

Shawshank Redemption, The (1994) 2227

Godfather, The (1972) 2223

Close Shave, A (1995) 657

Usual Suspects, The (1995) 1783

Schindler's List (1993) 2304

dtype: int64

参评人数超过1000的前十部高评分电影

In [40]: top_10_movies = mean_ratings.loc[top_ratings.index].sort_values(b

...: y='rating',ascending=False).head(10)

In [41]: top_10_movies

Out[41]:

rating

title

Shawshank Redemption, The (1994) 4.554558

Godfather, The (1972) 4.524966

Usual Suspects, The (1995) 4.517106

Schindler's List (1993) 4.510417

Raiders of the Lost Ark (1981) 4.477725

Rear Window (1954) 4.476190

Star Wars: Episode IV - A New Hope (1977) 4.453694

Dr. Strangelove or: How I Learned to Stop Worry... 4.449890

Casablanca (1942) 4.412822

Sixth Sense, The (1999) 4.406263

评分超高热度与参评人数超过1000的前十电影综合信息

In [42]: df_top_10_movies = pd.DataFrame(top_10_movies)

In [43]: df_top_10_movies['hot'] = top_ratings[top_10_movies.index]

In [44]: df_top_10_movies

Out[44]:

rating hot

title

Shawshank Redemption, The (1994) 4.554558 2227

Godfather, The (1972) 4.524966 2223

Usual Suspects, The (1995) 4.517106 1783

Schindler's List (1993) 4.510417 2304

Raiders of the Lost Ark (1981) 4.477725 2514

Rear Window (1954) 4.476190 1050

Star Wars: Episode IV - A New Hope (1977) 4.453694 2991

Dr. Strangelove or: How I Learned to Stop Worry... 4.449890 1367

Casablanca (1942) 4.412822 1669

Sixth Sense, The (1999) 4.406263 2459

干货免费分享

关注公众号即可一键领取

省去找资料的麻烦

为您的学习保驾护航

您的点赞与转发是我们前进的最大动力!

扫二维码进交流学习群

最新同步更新资料请到该QQ群获取

  • 发表于:
  • 原文链接http://kuaibao.qq.com/s/20180323G1W6Q800?refer=cp_1026
  • 腾讯「云+社区」是腾讯内容开放平台帐号(企鹅号)传播渠道之一,根据《腾讯内容开放平台服务协议》转载发布内容。
  • 如有侵权,请联系 yunjia_community@tencent.com 删除。

扫码关注云+社区

领取腾讯云代金券