前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >【数据分析可视化】用python分析了5000部票房,发现赚钱的电影都有这些特征~

【数据分析可视化】用python分析了5000部票房,发现赚钱的电影都有这些特征~

作者头像
瑞新
发布2020-07-07 11:47:42
1.3K0
发布2020-07-07 11:47:42
举报
文章被收录于专栏:用户3288143的专栏

1、采集数据

代码语言:javascript
复制
# 基础
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
from datetime import datetime
import json
import warnings
warnings.filterwarnings('ignore')# 忽略python运行过程中的警告

# 可视化
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud,STOPWORDS,ImageColorGenerator # 导入词云包
%matplotlib inline

2、导入数据

下面是moviedf数据集中部分字段的含义介绍:

id:

标识号

imdb id:IMDB标识号

popularity:

在Movie Database上的相对页面查看次数

budget:

预算(美元)

revenue:

收入(美元)

original_title:

电影名称

cast:

演员列表,按|分隔,最多5名演员

homepage:

电影首页的URL

director:

导演列表,按|分隔,最多5名导演

tagline:

电影的标语

keywords:

与电影相关的关键字,按|分隔,最多5个关键字

overview:

剧情摘要

runtime:

电影时长

genres:

风格列表,按|分隔,最多5种风格

production_companies:

制作公司列表,按|分隔,最多5家公司

release_date:

首次上映日期

vote_count:

评分次数

vote_average:

平均评分·release year:

发行年份

代码语言:javascript
复制
# 导入电影数据
credits_url = '/Users/bennyrhys/Desktop/数据分析可视化/homework/tmdb_5000_credits.csv'
movies_url = '/Users/bennyrhys/Desktop/数据分析可视化/homework/tmdb_5000_movies.csv'
credits = pd.read_csv(credits_url)
movies = pd.read_csv(movies_url)
代码语言:javascript
复制
credits.head()

movie_id

title

cast

crew

0

19995

Avatar

[{"cast_id": 242, "character": "Jake Sully", "...

[{"credit_id": "52fe48009251416c750aca23", "de...

1

285

Pirates of the Caribbean: At World's End

[{"cast_id": 4, "character": "Captain Jack Spa...

[{"credit_id": "52fe4232c3a36847f800b579", "de...

2

206647

Spectre

[{"cast_id": 1, "character": "James Bond", "cr...

[{"credit_id": "54805967c3a36829b5002c41", "de...

3

49026

The Dark Knight Rises

[{"cast_id": 2, "character": "Bruce Wayne / Ba...

[{"credit_id": "52fe4781c3a36847f81398c3", "de...

4

49529

John Carter

[{"cast_id": 5, "character": "John Carter", "c...

[{"credit_id": "52fe479ac3a36847f813eaa3", "de...

代码语言:javascript
复制
movies.head()

budget

genres

homepage

id

keywords

original_language

original_title

overview

popularity

production_companies

production_countries

release_date

revenue

runtime

spoken_languages

status

tagline

title

vote_average

vote_count

0

237000000

[{"id": 28, "name": "Action"}, {"id": 12, "nam...

http://www.avatarmovie.com/

19995

[{"id": 1463, "name": "culture clash"}, {"id":...

en

Avatar

In the 22nd century, a paraplegic Marine is di...

150.437577

[{"name": "Ingenious Film Partners", "id": 289...

[{"iso_3166_1": "US", "name": "United States o...

2009-12-10

2787965087

162.0

[{"iso_639_1": "en", "name": "English"}, {"iso...

Released

Enter the World of Pandora.

Avatar

7.2

11800

1

300000000

[{"id": 12, "name": "Adventure"}, {"id": 14, "...

http://disney.go.com/disneypictures/pirates/

285

[{"id": 270, "name": "ocean"}, {"id": 726, "na...

en

Pirates of the Caribbean: At World's End

Captain Barbossa, long believed to be dead, ha...

139.082615

[{"name": "Walt Disney Pictures", "id": 2}, {"...

[{"iso_3166_1": "US", "name": "United States o...

2007-05-19

961000000

169.0

[{"iso_639_1": "en", "name": "English"}]

Released

At the end of the world, the adventure begins.

Pirates of the Caribbean: At World's End

6.9

4500

2

245000000

[{"id": 28, "name": "Action"}, {"id": 12, "nam...

http://www.sonypictures.com/movies/spectre/

206647

[{"id": 470, "name": "spy"}, {"id": 818, "name...

en

Spectre

A cryptic message from Bond’s past sends him o...

107.376788

[{"name": "Columbia Pictures", "id": 5}, {"nam...

[{"iso_3166_1": "GB", "name": "United Kingdom"...

2015-10-26

880674609

148.0

[{"iso_639_1": "fr", "name": "Fran\u00e7ais"},...

Released

A Plan No One Escapes

Spectre

6.3

4466

3

250000000

[{"id": 28, "name": "Action"}, {"id": 80, "nam...

http://www.thedarkknightrises.com/

49026

[{"id": 849, "name": "dc comics"}, {"id": 853,...

en

The Dark Knight Rises

Following the death of District Attorney Harve...

112.312950

[{"name": "Legendary Pictures", "id": 923}, {"...

[{"iso_3166_1": "US", "name": "United States o...

2012-07-16

1084939099

165.0

[{"iso_639_1": "en", "name": "English"}]

Released

The Legend Ends

The Dark Knight Rises

7.6

9106

4

260000000

[{"id": 28, "name": "Action"}, {"id": 12, "nam...

http://movies.disney.com/john-carter

49529

[{"id": 818, "name": "based on novel"}, {"id":...

en

John Carter

John Carter is a war-weary, former military ca...

43.926995

[{"name": "Walt Disney Pictures", "id": 2}]

[{"iso_3166_1": "US", "name": "United States o...

2012-03-07

284139100

132.0

[{"iso_639_1": "en", "name": "English"}]

Released

Lost in our world, found in another.

John Carter

6.1

2124

三、数据清洗

1、先将credits数据集和moviedf数据集中的数据合并在一起,再查看合并后的数据集信息:

代码语言:javascript
复制
# 合并数据集
fulldf = pd.concat([credits,movies],axis=1)
fulldf.info()
代码语言:javascript
复制
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 24 columns):
movie_id                4803 non-null int64
title                   4803 non-null object
cast                    4803 non-null object
crew                    4803 non-null object
budget                  4803 non-null int64
genres                  4803 non-null object
homepage                1712 non-null object
id                      4803 non-null int64
keywords                4803 non-null object
original_language       4803 non-null object
original_title          4803 non-null object
overview                4800 non-null object
popularity              4803 non-null float64
production_companies    4803 non-null object
production_countries    4803 non-null object
release_date            4802 non-null object
revenue                 4803 non-null int64
runtime                 4801 non-null float64
spoken_languages        4803 non-null object
status                  4803 non-null object
tagline                 3959 non-null object
title                   4803 non-null object
vote_average            4803 non-null float64
vote_count              4803 non-null int64
dtypes: float64(3), int64(5), object(16)
memory usage: 900.7+ KB

2、选取子集

代码语言:javascript
复制
# 由于数据集中包含的信息过多,其中部分数据并不是我们研究的重点,所以从中选取我们需要的数据:
moviesdf = fulldf[['original_title','crew','release_date','genres','keywords','production_companies',
       'production_countries','revenue','budget','runtime','vote_average']]
moviesdf.info()
代码语言:javascript
复制
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 11 columns):
original_title          4803 non-null object
crew                    4803 non-null object
release_date            4802 non-null object
genres                  4803 non-null object
keywords                4803 non-null object
production_companies    4803 non-null object
production_countries    4803 non-null object
revenue                 4803 non-null int64
budget                  4803 non-null int64
runtime                 4801 non-null float64
vote_average            4803 non-null float64
dtypes: float64(2), int64(2), object(7)
memory usage: 412.9+ KB
代码语言:javascript
复制
# 由于后面的数据分析涉及到电影类型的利润计算,先求出每部电影的利润,并在数据集moviesdf中增加profit数据列:
moviesdf['profit'] = moviesdf['revenue'] - moviesdf['budget']
moviesdf.head()

original_title

crew

release_date

genres

keywords

production_companies

production_countries

revenue

budget

runtime

vote_average

profit

0

Avatar

[{"credit_id": "52fe48009251416c750aca23", "de...

2009-12-10

[{"id": 28, "name": "Action"}, {"id": 12, "nam...

[{"id": 1463, "name": "culture clash"}, {"id":...

[{"name": "Ingenious Film Partners", "id": 289...

[{"iso_3166_1": "US", "name": "United States o...

2787965087

237000000

162.0

7.2

2550965087

1

Pirates of the Caribbean: At World's End

[{"credit_id": "52fe4232c3a36847f800b579", "de...

2007-05-19

[{"id": 12, "name": "Adventure"}, {"id": 14, "...

[{"id": 270, "name": "ocean"}, {"id": 726, "na...

[{"name": "Walt Disney Pictures", "id": 2}, {"...

[{"iso_3166_1": "US", "name": "United States o...

961000000

300000000

169.0

6.9

661000000

2

Spectre

[{"credit_id": "54805967c3a36829b5002c41", "de...

2015-10-26

[{"id": 28, "name": "Action"}, {"id": 12, "nam...

[{"id": 470, "name": "spy"}, {"id": 818, "name...

[{"name": "Columbia Pictures", "id": 5}, {"nam...

[{"iso_3166_1": "GB", "name": "United Kingdom"...

880674609

245000000

148.0

6.3

635674609

3

The Dark Knight Rises

[{"credit_id": "52fe4781c3a36847f81398c3", "de...

2012-07-16

[{"id": 28, "name": "Action"}, {"id": 80, "nam...

[{"id": 849, "name": "dc comics"}, {"id": 853,...

[{"name": "Legendary Pictures", "id": 923}, {"...

[{"iso_3166_1": "US", "name": "United States o...

1084939099

250000000

165.0

7.6

834939099

4

John Carter

[{"credit_id": "52fe479ac3a36847f813eaa3", "de...

2012-03-07

[{"id": 28, "name": "Action"}, {"id": 12, "nam...

[{"id": 818, "name": "based on novel"}, {"id":...

[{"name": "Walt Disney Pictures", "id": 2}]

[{"iso_3166_1": "US", "name": "United States o...

284139100

260000000

132.0

6.1

24139100

3、缺失值处理

通过上面的数据集信息可以知道:整个数据集缺失的数据比较少 其中release_date(首次上映日期)缺失1个数据,runtime(电影时长)缺失2个数据,可以通过网上查询补齐这个数据。

填补release_date(首次上映日期)数据:

代码语言:javascript
复制
# 找出首次上映时间的缺失数据
release_date_null = moviesdf['release_date'].isnull()
moviesdf.loc[release_date_null]

original_title

crew

release_date

genres

keywords

production_companies

production_countries

revenue

budget

runtime

vote_average

profit

4553

America Is Still the Place

[]

NaN

[]

[]

[]

[]

0

0

0.0

0.0

0

代码语言:javascript
复制
# 填充日期
moviesdf['release_date'] = moviesdf['release_date'].fillna('2014-06-01')
# 修改日期格式
moviesdf['release_date'] = pd.to_datetime(moviesdf['release_date'], format='%Y-%m-%d')
moviesdf.info()
代码语言:javascript
复制
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 12 columns):
original_title          4803 non-null object
crew                    4803 non-null object
release_date            4803 non-null datetime64[ns]
genres                  4803 non-null object
keywords                4803 non-null object
production_companies    4803 non-null object
production_countries    4803 non-null object
revenue                 4803 non-null int64
budget                  4803 non-null int64
runtime                 4801 non-null float64
vote_average            4803 non-null float64
profit                  4803 non-null int64
dtypes: datetime64[ns](1), float64(2), int64(3), object(6)
memory usage: 450.4+ KB
代码语言:javascript
复制
# 找出runtime(电影时长)缺失的数据:
runtime_date_null = moviesdf['runtime'].isnull()
moviesdf.loc[runtime_date_null]

original_title

crew

release_date

genres

keywords

production_companies

production_countries

revenue

budget

runtime

vote_average

profit

2656

Chiamatemi Francesco - Il Papa della gente

[{"credit_id": "5660019ac3a36875f100252b", "de...

2015-12-03

[{"id": 18, "name": "Drama"}]

[{"id": 717, "name": "pope"}, {"id": 5565, "na...

[{"name": "Taodue Film", "id": 45724}]

[{"iso_3166_1": "IT", "name": "Italy"}]

0

15000000

NaN

7.3

-15000000

4140

To Be Frank, Sinatra at 100

[{"credit_id": "592b25e4c3a368783e065a2f", "de...

2015-12-12

[{"id": 99, "name": "Documentary"}]

[{"id": 6027, "name": "music"}, {"id": 225822,...

[{"name": "Eyeline Entertainment", "id": 60343}]

[{"iso_3166_1": "GB", "name": "United Kingdom"}]

0

2

NaN

0.0

-2

代码语言:javascript
复制
# 填充runtime缺失值
value1 = {'runtime':98.0}
value2 = {'runtime':81.0}
moviesdf.fillna(value=value1, limit=1, inplace=True)
moviesdf.fillna(value=value2, limit=1, inplace=True)

moviesdf.loc[runtime_date_null]

original_title

crew

release_date

genres

keywords

production_companies

production_countries

revenue

budget

runtime

vote_average

profit

2656

Chiamatemi Francesco - Il Papa della gente

[{"credit_id": "5660019ac3a36875f100252b", "de...

2015-12-03

[{"id": 18, "name": "Drama"}]

[{"id": 717, "name": "pope"}, {"id": 5565, "na...

[{"name": "Taodue Film", "id": 45724}]

[{"iso_3166_1": "IT", "name": "Italy"}]

0

15000000

98.0

7.3

-15000000

4140

To Be Frank, Sinatra at 100

[{"credit_id": "592b25e4c3a368783e065a2f", "de...

2015-12-12

[{"id": 99, "name": "Documentary"}]

[{"id": 6027, "name": "music"}, {"id": 225822,...

[{"name": "Eyeline Entertainment", "id": 60343}]

[{"iso_3166_1": "GB", "name": "United Kingdom"}]

0

2

81.0

0.0

-2

4、数据格式转换

代码语言:javascript
复制
# genres列数据处理:
# genres列格式化
moviesdf['genres'] = moviesdf['genres'].apply(json.loads)
# 自定义函数解码json数据
def decode(column):
    z = []
    for i in column:
        z.append(i['name'])
    return ' '.join(z)

moviesdf['genres'] = moviesdf['genres'].apply(decode)
moviesdf.head(2)

original_title

crew

release_date

genres

keywords

production_companies

production_countries

revenue

budget

runtime

vote_average

profit

0

Avatar

[{"credit_id": "52fe48009251416c750aca23", "de...

2009-12-10

Action Adventure Fantasy Science Fiction

[{"id": 1463, "name": "culture clash"}, {"id":...

[{"name": "Ingenious Film Partners", "id": 289...

[{"iso_3166_1": "US", "name": "United States o...

2787965087

237000000

162.0

7.2

2550965087

1

Pirates of the Caribbean: At World's End

[{"credit_id": "52fe4232c3a36847f800b579", "de...

2007-05-19

Adventure Fantasy Action

[{"id": 270, "name": "ocean"}, {"id": 726, "na...

[{"name": "Walt Disney Pictures", "id": 2}, {"...

[{"iso_3166_1": "US", "name": "United States o...

961000000

300000000

169.0

6.9

661000000

代码语言:javascript
复制
# 建立genres列表,提取电影的类型
genres_list = set()
for i in moviesdf['genres'].str.split(' '):
    genres_list = set().union(i, genres_list)
    genres_list = list(genres_list)
    genres_list
    
genres_list.remove('')
genres_list
代码语言:javascript
复制
['Animation',
 'Foreign',
 'Drama',
 'War',
 'Music',
 'Western',
 'History',
 'Documentary',
 'TV',
 'Action',
 'Family',
 'Romance',
 'Horror',
 'Comedy',
 'Mystery',
 'Thriller',
 'Fantasy',
 'Crime',
 'Movie',
 'Fiction',
 'Adventure',
 'Science']
代码语言:javascript
复制
# release_date列数据处理:
# 保留日期中的年份
moviesdf['release_date'] = pd.to_datetime(moviesdf['release_date']).dt.year
columns = {'release_date':'year'}
moviesdf.rename(columns=columns, inplace=True)
moviesdf['year'].apply(int).head()
代码语言:javascript
复制
0    2009
1    2007
2    2015
3    2012
4    2012
Name: year, dtype: int64

四、数据分析及可视化

问题一:电影类型如何随着时间的推移发生变化的?

代码语言:javascript
复制
# 1、建立包含年份与电影类型数量的关系数据框:

for genre in genres_list:
    moviesdf[genre] = moviesdf['genres'].str.contains(genre).apply(lambda x:1 if x else 0)
代码语言:javascript
复制
genre_year = moviesdf.loc[:,genres_list]
genre_year.head()

Animation

Foreign

Drama

War

Music

Western

History

Documentary

TV

Action

...

Horror

Comedy

Mystery

Thriller

Fantasy

Crime

Movie

Fiction

Adventure

Science

0

0

0

0

0

0

0

0

0

0

1

...

0

0

0

0

1

0

0

1

1

1

1

0

0

0

0

0

0

0

0

0

1

...

0

0

0

0

1

0

0

0

1

0

2

0

0

0

0

0

0

0

0

0

1

...

0

0

0

0

0

1

0

0

1

0

3

0

0

1

0

0

0

0

0

0

1

...

0

0

0

1

0

1

0

0

0

0

4

0

0

0

0

0

0

0

0

0

1

...

0

0

0

0

0

0

0

1

1

1

5 rows × 22 columns

代码语言:javascript
复制
# 把年份作为索引标签
genre_year.index = moviesdf['year']
# 将数据集按年份分组并求和,得出每个年份,各电影类型的总数
genresdf = genre_year.groupby('year').sum()
# 查看数据集.tail默认查看后五行数据
genresdf.tail()

Animation

Foreign

Drama

War

Music

Western

History

Documentary

TV

Action

...

Horror

Comedy

Mystery

Thriller

Fantasy

Crime

Movie

Fiction

Adventure

Science

year

2013

17

0

110

3

12

1

8

10

2

56

...

25

71

5

53

21

37

2

27

36

27

2014

14

0

110

10

9

3

7

7

0

54

...

21

62

15

66

16

27

0

26

37

26

2015

13

0

95

2

8

7

9

7

0

46

...

33

52

20

67

10

26

0

28

35

28

2016

4

0

37

3

1

1

6

0

0

39

...

20

26

6

27

13

10

0

11

23

11

2017

0

0

1

0

0

0

0

0

0

0

...

0

1

0

0

0

0

0

0

0

0

5 rows × 22 columns

代码语言:javascript
复制
# 汇总电影类型的数量
genresdfSum = genresdf.sum(axis=0).sort_values(ascending=False)
genresdfSum
代码语言:javascript
复制
Drama          2297
Comedy         1722
Thriller       1274
Action         1154
Romance         894
Adventure       790
Crime           696
Science         535
Fiction         535
Horror          519
Family          513
Fantasy         424
Mystery         348
Animation       234
History         197
Music           185
War             144
Documentary     110
Western          82
Foreign          34
TV                8
Movie             8
dtype: int64
代码语言:javascript
复制
# 2、数据可视化

# 绘制各种电影类型的数量柱状图

# 设置画板大小
fig=plt.figure(figsize=(12,8))
# 创建画纸,这里只使用1
ax1 = plt.subplot(111)
# 在画纸上绘图
# 电影类型的数量按降序排序
rects = genresdfSum.sort_values(ascending=True).plot(kind='barh', label='genres')
plt.title('1各种电影类型的数量统计图')
plt.xlabel('2电影数量(部)',fontsize=15)
plt.ylabel('3电影类型',fontsize = 15)
plt.show()
代码语言:javascript
复制
# 绘制各种电影类型占比的饼状图:
genres_pie = genresdfSum / genresdfSum.sum()

# 设置other类,当电影类型所占比例小于%1时,全部归到other类中
others = 0.01
genres_pie_otr = genres_pie[genres_pie >= others]
genres_pie_otr['Other'] = genres_pie[genres_pie < others].sum()

# 所占比例小于或等于%2时,对应的饼状图往外长高一截
explode = (genres_pie_otr <= 0.02) / 20 + 0.05

# 设置饼状图的参数
genres_pie_otr.plot(kind='pie', label='',startangle=50,shadow=False,
                   figsize=(10,10),autopct='%1.11f%%', explode=explode)
plt.title('a各种电影类型所占比例')
代码语言:javascript
复制
Text(0.5, 1.0, 'a各种电影类型所占比例')

分析结论:

从上面的结果可以看出,在所有的电影类型中,Drama(戏剧)类型电影最多,占所有电影类型的18.9%,其次为Comedy(喜剧),占所有电影类型的14.2%。

在所有电影类型中,电影数量排名前5的电影类型分别为:

Drama(戏剧)、Comedy(喜剧)、Thriller(惊悚)、Action(动作)、Romance(冒险)。

代码语言:javascript
复制
# 3、电影类型随时间变化的趋势分析:
plt.figure(figsize=(12,8))
plt.plot(genresdf, label=genresdf.columns)
plt.xticks(range(1910,2018,5))
plt.legend(genresdf)
plt.title('1电影类型随时间变化的趋势',fontsize=15)
plt.xlabel('2年份',fontsize=15)
plt.ylabel('3数量(部)',fontsize=15)
plt.grid(True)
plt.show()

分析结论:

从图中观察到,随着时间的推移,所有电影类型都呈现出增长趋势,尤其是1992年以后各个类型的电影均增长迅速,其中Drama(戏剧)和Comedy(喜剧)增长最快,目前仍是最热门的电影类型。

问题二:电影类型与利润的关系?

代码语言:javascript
复制
# 先求出各种电影类型的平均利润:

# 把电影类型作为索引
mean_genre_profit = DataFrame(index=genres_list)

# 求出每种电影类型的平均利润
newarray = []
for genre in genres_list:
    newarray.append(moviesdf.groupby(genre,as_index=True)['profit'].mean())
newarray2 = []
for i in range(len(genres_list)):
    newarray2.append(newarray[i][1])
mean_genre_profit['mean_profit'] = newarray2
mean_genre_profit.head()

mean_profit

Animation

1.592271e+08

Foreign

-2.934369e+05

Drama

3.143791e+07

War

4.887342e+07

Music

3.254800e+07

代码语言:javascript
复制
# 电影类型平均利润数据可视化:

# 数据可视化
plt.figure(figsize=(12,8))

# 对于mean_profig列数据按值大小进行降序排序
mean_genre_profit.sort_values(by='mean_profit',ascending=True).plot(kind='barh')

plt.title('1各种电影类型的平均利润')
plt.xlabel('2平均利润(美元)')
plt.ylabel('3电影类型')
plt.grid(True)
plt.show()
代码语言:javascript
复制
<Figure size 864x576 with 0 Axes>

分析结论:

从图中观察到,拍摄Animation、Adventure、Fantasy这三类电影盈利最好,而拍摄Foreign、TV、Movie这三类电影会存在亏本的风险。

问题三:Universal Pictures和Paramount Pictures两家影视公司发行电影的对比情况如何?

Universal Pictures(环球影业)和Paramount Pictures(派拉蒙影业)是美国两家电影巨头公司。

1、查看 Universal Pictures和Paramount Pictures两家影视公司电影发行的数量

先对production_companies列数据进行处理:

代码语言:javascript
复制
# production_companies列数据格式化
moviesdf['production_companies'] = moviesdf['production_companies'].apply(json.loads)
# 调用自定义函数decode处理production_companies列数据
moviesdf['production_companies'] = moviesdf['production_companies'].apply(decode)
moviesdf.head(2)

original_title

crew

year

genres

keywords

production_companies

production_countries

revenue

budget

runtime

...

Horror

Comedy

Mystery

Thriller

Fantasy

Crime

Movie

Fiction

Adventure

Science

0

Avatar

[{"credit_id": "52fe48009251416c750aca23", "de...

2009

Action Adventure Fantasy Science Fiction

[{"id": 1463, "name": "culture clash"}, {"id":...

Ingenious Film Partners Twentieth Century Fox ...

[{"iso_3166_1": "US", "name": "United States o...

2787965087

237000000

162.0

...

0

0

0

0

1

0

0

1

1

1

1

Pirates of the Caribbean: At World's End

[{"credit_id": "52fe4232c3a36847f800b579", "de...

2007

Adventure Fantasy Action

[{"id": 270, "name": "ocean"}, {"id": 726, "na...

Walt Disney Pictures Jerry Bruckheimer Films S...

[{"iso_3166_1": "US", "name": "United States o...

961000000

300000000

169.0

...

0

0

0

0

1

0

0

0

1

0

2 rows × 34 columns

代码语言:javascript
复制
# 查询production_companies数据列并统计Universal Pictures和Paramount Pictures的数据:

# 查询production_companies数据列中是否含有Universal Pictures,Paramount Pictures有则标记为1
moviesdf['Universal Pictures'] = moviesdf['production_companies'].str.contains('Universal Pictures').apply(lambda x:1 if x else 0)
moviesdf['Paramount Pictures'] = moviesdf['production_companies'].str.contains('Paramount Pictures').apply(lambda x:1 if x else 0)
代码语言:javascript
复制
# 统计Universal Pictures和Paramount Pictures的数据
a = moviesdf['Universal Pictures'].sum()
b = moviesdf['Paramount Pictures'].sum()
dict_company = {'Universal':a, 'Paramount':b}
company_vs = Series(dict_company)
company_vs
代码语言:javascript
复制
Universal    314
Paramount    285
dtype: int64
代码语言:javascript
复制
# 使用饼状图比较两家公司发行的电影占比:
company_vs.plot(kind='pie', label='', startangle=50, shadow=False, autopct='%1.1f%%')
plt.title('Universal Pictures和Paramount Pictures两家公司电影发行数量对比',fontsize=13)
代码语言:javascript
复制
Text(0.5, 1.0, 'Universal Pictures和Paramount Pictures两家公司电影发行数量对比')
代码语言:javascript
复制
# 2、分析Universal Pictures和Paramount Pictures两家影视公司电影发行的走势

# 抽取相关数据列进行处理:
companydf = moviesdf[['Universal Pictures','Paramount Pictures']]
companydf.index = moviesdf['year']

# 对Universal和Paramount公司的发行数量按年分组求和
companydf = companydf.groupby('year').sum()
companydf.tail()

Universal Pictures

Paramount Pictures

year

2013

9

8

2014

10

8

2015

13

7

2016

10

5

2017

0

0

代码语言:javascript
复制
# 两家影视公司电影发行的折线图:
plt.figure(figsize=(12,8))
plt.plot(companydf, label=companydf.columns)
plt.xticks(range(1910,2018,5))
plt.legend(companydf)
plt.title('Universal Pictures和Paramount Pictures公司电影的发行量时间走势', fontsize=15)
plt.xlabel('年份', fontsize = 15)
plt.ylabel('数量(部)',fontsize=15)
plt.grid(True)
plt.show()

分析结论:

从图中观察到,随着时间的推移,Universal Pictures和Paramount Pictures公司的电影发行量呈现出增长趋势,尤其是在1995年后增长迅速,其中Universal Pictures公司比Paramount Pictures公司发行的电影数量更多。

问题四:改编电影和原创电影的对比情况如何?

代码语言:javascript
复制
# 对keywords列数据处理:

# 格式化
moviesdf['keywords'] = moviesdf['keywords'].apply(json.loads)
# 调用自定义的函数decode处理keywords列数据
moviesdf['keywords'] = moviesdf['keywords'].apply(decode)
moviesdf['keywords'].tail()
代码语言:javascript
复制
4798    united states–mexico barrier legs arms paper k...
4799                                                     
4800    date love at first sight narration investigati...
4801                                                     
4802                 obsession camcorder crush dream girl
Name: keywords, dtype: object
代码语言:javascript
复制
# 提取关键字
a = 'based on novel'
moviesdf['if_original'] = moviesdf['keywords'].str.contains(a).apply(lambda x: 'no original' if x else 'original')
moviesdf['if_original'].value_counts()
代码语言:javascript
复制
original       4606
no original     197
Name: if_original, dtype: int64
代码语言:javascript
复制
original_profit = moviesdf[['if_original','budget','revenue','profit']]
original_profit = original_profit.groupby(by='if_original').mean()
original_profit

budget

revenue

profit

if_original

no original

4.532546e+07

1.438100e+08

9.848457e+07

original

2.834872e+07

7.962815e+07

5.127943e+07

代码语言:javascript
复制
# 描绘柱状图,对改编电影与原创电影在预算、收入及利润三方面进行比较:

# 可视化柱状图
plt.figure(figsize=(12,8))
original_profit.plot(kind='bar')
plt.title('改编电影与原创电影在预算、收入和利润的比较')
plt.xlabel('改编电影与原创电影')
plt.ylabel('金额(美元)')
plt.show()
代码语言:javascript
复制
<Figure size 864x576 with 0 Axes>

分析结论:

从图上可以看出,改编电影的预算略高于原创电影,但改编电影的票房收入和利润远远高于原创电影, 这可能是改编电影拥有一定的影迷基础。

问题五:电影时长与电影票房及评分的关系

代码语言:javascript
复制
# 电影时长与电影票房的关系:
moviesdf.plot(kind='scatter', x='runtime', y='revenue', figsize=(8,6))
plt.title('电影时长与电影票房的关系',fontsize = 15)
plt.xlabel('电影时长(分钟)',fontsize=15)
plt.ylabel('电影票房(美元)',fontsize=15)
plt.grid(True)
plt.show()
代码语言:javascript
复制
# 电影时长与电影平均评分的关系:
moviesdf.plot(kind='scatter', x='runtime', y='vote_average', figsize=(8,6))
plt.title('电影时长与电影平均评分的关系',fontsize=15)
plt.xlabel('电影时长(分钟)',fontsize=15)
plt.ylabel('电影平均评分',fontsize=15)
plt.grid(True)
plt.show()

分析结论:

从图上可以看出,电影要想获得较高的票房及良好的口碑,电影的时长应保持在90~150分钟内。

问题六:分析电影关键字

代码语言:javascript
复制
# 先提取电影关键字:

# 利用电影关键字制作词云图
# 建立keywords_list列表
keywords_list = []
for i in moviesdf['keywords']:
    keywords_list.append(i)
    keywords_list = list(keywords_list)
    keywords_list

    
# 把字符串列表链接成一个长字符串
lis = ''.join(keywords_list)
# 使用空格替换中间多余的字符串'\'s'
lis.replace('\'s', '')

'culture clash future space war space colony society space travel futuristic romance space alien tribe alien planet cgi marine soldier battle love affair anti war power relations mind and soul 3docean drug abuse exotic island east india trading company love of one life traitor shipwreck strong woman ship alliance calypso afterlife fighter pirate

代码语言:javascript
复制
# 通过词云包WordCloud生成词云图:
wc = WordCloud(background_color='black',
              max_words=2000,
              max_font_size=100,
              random_state=12)

# 根据字符串生成词云
wc.generate(lis)

plt.figure(figsize=(16,8))
# 显示图片
plt.imshow(wc)
plt.axis('off')
plt.show()

分析结论:

通过对电影关键字的分析,电影中经常被提及的词语是女性(woman)、独立(independent),其次是谋杀(murder)、爱情(love)、警察(police)、暴力(violence),可见观众对女性和独立方面题材的电影最感兴趣,其次是是犯罪类和爱情类电影。

本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2020/05/14 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 1、采集数据
  • 2、导入数据
  • 三、数据清洗
    • 1、先将credits数据集和moviedf数据集中的数据合并在一起,再查看合并后的数据集信息:
      • 2、选取子集
        • 3、缺失值处理
          • 4、数据格式转换
          • 四、数据分析及可视化
            • 问题一:电影类型如何随着时间的推移发生变化的?
              • 问题二:电影类型与利润的关系?
                • 问题三:Universal Pictures和Paramount Pictures两家影视公司发行电影的对比情况如何?
                  • 问题四:改编电影和原创电影的对比情况如何?
                    • 问题五:电影时长与电影票房及评分的关系
                      • 问题六:分析电影关键字
                      领券
                      问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档