今天,我们将研究如何从热门电影网站Rotten Tomatoes爬取数据。你需要在这里注册一个API key。当你拿到key时,记下你的使用限制(如每分钟限制的爬取次数)。你不要对API进行超限调用,这可能会使key失效。最后,阅读你将要使用的API的文档是一个好办法。这里有几个链接:
你可以把这些文档保存起来稍后再看,现在我们继续。
Rotten Tomatoes的API提供了一套可以从中提取数据的json模板。我们将使用requests和simplejson来获取数据并处理它。让我们写一个可以获取当前正在播放的电影小脚本。
# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - - -- -- -- -- -- -- -- -- -- --
def getInTheaterMovies ( ) :
"" "
Get a list of movies in theaters .
"" "
key = "你的API key"
url = "http://api.rottentomatoes.com/api/public/v1.0/lists/movies/in_theaters.json?apikey=%s"
res = requests . get ( url % key )
data = res . content
js = simplejson . loads ( data )
movies = js [ "movies" ]
for movie in movies :
print movie [ "title" ] #这里用的是Pyhon2,需要适当的修改来兼容Python3
# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - - -- -- -- -- -- -- -- -- -- --
if __name__ == "__main__" :
getInTheaterMovies ( )
如果你运行这个代码,你会看到一个打印出来的电影列表。我得到了以下输出:
Free Birds
Gravity
Ender's Game
Jackass Presents: Bad Grandpa
Last Vegas
The Counselor
Cloudy with a Chance of Meatballs 2
Captain Phillips
Carrie
Escape Plan
Enough Said
Insidious: Chapter 2
12 Years a Slave
We're The Millers
Prisoners
Baggage Claim
在上面的代码中,我们使用API key构建了一个URL并获得数据。然后我们将数据加载到Python嵌套字典的simplejson中。接下来,我们循环遍历电影字典(dictionary)并打印出每部电影的标题。现在我们准备创建一个新功能,从Rotten Tomatoes中提取关于这些电影中的每一个附加信息。
import requests
import simplejson
import urllib
# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - - -- -- -- -- -- -- -- -- -- --
def getMovieDetails ( key , title ) :
"" "
Get additional movie details
"" "
if " " in title :
parts = title . split ( " " )
title = "+" . join ( parts )
link = "http://api.rottentomatoes.com/api/public/v1.0/movies.json"
url = "%s?apikey=%s&q=%s&page_limit=1"
url = url % ( link , key , title )
res = requests . get ( url )
js = simplejson . loads ( res . content )
for movie in js [ "movies" ] :
print "rated: %s" % movie [ "mpaa_rating" ]
print "movie synopsis: " + movie [ "synopsis" ]
print "critics_consensus: " + movie [ "critics_consensus" ]
print "Major cast:"
for actor in movie [ "abridged_cast" ] :
print "%s as %s" % ( actor [ "name" ] , actor [ "characters" ] [ 0 ] )
ratings = movie [ "ratings" ]
print "runtime: %s" % movie [ "runtime" ]
print "critics score: %s" % ratings [ "critics_score" ]
print "audience score: %s" % ratings [ "audience_score " ]
print "for more information: %s" % movie [ "links" ] [ "alternate" ]
print "-" * 40
print
# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - - -- -- -- -- -- -- -- -- -- --
def getInTheaterMovies ( ) :
"" "
Get a list of movies in theaters .
"" "
key = "你的 API CODE"
url = "http://api.rottentomatoes.com/api/public/v1.0/lists/movies/in_theaters.json?apikey=%s"
res = requests . get ( url % key )
data = res . content
js = simplejson . loads ( data )
movies = js [ "movies" ]
for movie in movies :
print movie [ "title" ]
getMovieDetails ( key , movie [ "title" ] )
print
# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - - -- -- -- -- -- -- -- -- -- --
if __name__ == "__main__" :
getInTheaterMovies ( )
这个新代码提取了关于每部电影的大量数据,但json的返回包含了更多信息,我们没有全部展示出来。只需将js字典输出到stdout 即可看到还有什么没显示出来,或者你可以在Rotten Tomatoes 文档页面看到一个返回json的示例。如果你仔细观察,你就会发现Rotten Tomatoes API并没有涵盖他们网站上的全部数据。例如,没有办法获取电影的演员信息。例如,如果我们想知道Jim Carrey参演过的电影,没有公开的API可供利用。你也不能查看演出表中的其他人,如导演或制片人。这些信息网站上都有,API没有被公开。为此,我们不得不求助于互联网电影数据库(IMDB),在这里我们队这个问题不会继续讨论。
让我们花点时间改进这个例子。一个简单的改进是将API key放入配置文件中(这样就不会很容易地被别人一眼就看到)。另一个存储我们爬取到的信息。第三个改进是添加一些代码来检查我们是否已经下载了今天的全部电影,因为实际上没有理由每天下载一次全部的数据!
我更喜欢并推荐ConfigObj来处理配置文件。让我们用下面的代码创建一个简单的“config.ini”文件:
api_key = API KEY
last_downloaded =
现在来改变我们的代码导入ConfigObj并更改getInTheaterMovies函数来使用它:
import requests
import simplejson
import urllib
from configobj import ConfigObj
# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - - -- -- -- -- -- -- -- -- -- --
def getInTheaterMovies ( ) :
"" "
Get a list of movies in theaters .
"" "
config = ConfigObj ( "config.ini" )
key = config [ "Settings" ] [ "api_key" ]
url = "http://api.rottentomatoes.com/api/public/v1.0/lists/movies/in_theaters. json?apikey=%s"
res = requests . get ( url % key )
data = res . content
js = simplejson . loads ( data )
movies = js [ "movies" ]
for movie in movies :
print movie [ "title" ]
getMovieDetails ( key , movie [ "title" ] )
print
# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - - -- -- -- -- -- -- -- -- -- --
if __name__ == "__main__" :
getInTheaterMovies ( )
正如你所看到的,我们导入configobj并将Key等信息载入。您也可以使用绝对路径。接下来我们提取api_key的值并在我们的URL中使用它。由于我们的配置中有一个last_downloaded值,因此我们应该将其添加到我们的代码中,以防止我们每天下载重复数据。
import datetime
import requests
import simplejson
import urllib
from configobj import ConfigObj
# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - - -- -- -- -- -- -- -- -- -- --
def getInTheaterMovies ( ) :
"" "
Get a list of movies in theaters .
"" "
today = datetime . datetime . today ( ) . strftime ( "%Y%m%d" )
config = ConfigObj ( "config.ini" )
if today != config [ "Settings" ] [ "last_downloaded" ] :
config [ "Settings" ] [ "last_downloaded" ] = today
try :
with open ( "config.ini" , "w" ) as cfg :
config . write ( cfg )
except IOError :
print "Error writing file!"
return
key = config [ "Settings" ] [ "api_key" ]
url = "http://api.rottentomatoes.com/api/public/v1.0/lists/movies/in_theaters.json?apikey=%s"
res = requests . get ( url % key )
data = res . content
js = simplejson . loads ( data )
movies = js [ "movies" ]
for movie in movies :
print movie [ "title" ]
getMovieDetails ( key , movie [ "title" ] )
print
# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - - -- -- -- -- -- -- -- -- -- --
if __name__ == "__main__" :
getInTheaterMovies ( )
在这里,我们导入Python的日期时间(datetime)模块,并使用如下格式获取今天的日期:YYYYMMDD。接下来我们检查配置文件的last_downloaded值是否等于今天的日期。如果相等,我们什么都不做。但是,如果它们不匹配,我们将last_downloaded设置为今天的日期,然后我们下载电影数据。现在我们准备了解如何将数据保存到数据库。
自2.5版本起,Python支持原生SQLite数据库,因此除非您使用的是旧版本的Python,否则您应该顺利地完成这一部分。大致上,我们只需要添加一个可以创建数据库并将数据保存到其中的函数。函数如下:
# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - - -- -- -- -- -- -- -- -- -- --
def saveData ( movie ) :
"" "
Save the data to a SQLite database
"" "
if not os . path . exists ( "movies.db" ) :
# create the database
conn = sqlite3 . connect ( "movies.db" )
cursor = conn . cursor ( )
cursor . execute ( "" "CREATE TABLE movies
( title text , rated text , movie_synopsis text ,
critics_consensus text , runtime integer ,
critics_score integer , audience_score integer ) "" " )
cursor . execute ( "" "
CREATE TABLE cast
( actor text ,
character text )
"" " )
cursor . execute ( "" "
CREATE TABLE movie_cast
( movie_id integer ,
cast_id integer ,
FOREIGN KEY ( movie_id ) REFERENCES movie ( id ) ,
FOREIGN KEY ( cast_id ) REFERENCES cast ( id )
)
"" " )
else :
conn = sqlite3 . connect ( "movies.db" )
cursor = conn .cursor ( )
# insert the data
print
sql = "INSERT INTO movies VALUES(?, ?, ?, ?, ?, ?, ?)"
cursor . execute ( sql , ( movie [ "title" ] ,
movie [ "mpaa_rating" ] ,
movie [ "synopsis" ] ,
movie [ "critics_consensus" ] ,
movie [ "runtime" ] ,
movie [ "ratings"] ["critics_score" ] ,
movie [ "ratings" ] [ "audience_score" ]
)
)
movie_id = cursor . lastrowid
for actor in movie [ "abridged_cast" ] :
print "%s as %s" % ( actor [ "name" ] , actor [ "characters" ] [ 0 ] )
sql = "INSERT INTO cast VALUES(?, ?)"
cursor . execute ( sql , ( actor [ "name" ] ,
actor [ "characters"] [ 0 ]
)
)
cast_id = cursor . lastrowid
sql = "INSERT INTO movie_cast VALUES(?, ?)"
cursor . execute ( sql , ( movie_id , cast_id ) )
conn . commit ( )
conn . close ( )
该代码首先检查数据库文件是否已经存在。如果不存在,那么它将创建1个数据库以及3个表。否则,saveData函数将创建一个数据库连接和一个Cursor(游标)对象。接下来,它将把影片字典数据插入数据库。我们将调用该函数并从getMovieDetails函数传递电影字典。最后,我们将数据提交到数据库并关闭连接。
您可能想知道完整的代码是什么样子。那么:
import datetime
import os
import requests
import simplejson
import sqlite3
import urllib
from configobj import ConfigObj
# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - - -- -- -- -- -- -- -- -- -- --
def getMovieDetails ( key , title ) :
"" "
Get additional movie details
"" "
if " " in title :
parts = title . split ( " " )
title = "+" . join ( parts )
link = "http://api.rottentomatoes.com/api/public/v1.0/movies.json"
url = "%s?apikey=%s&q=%s&page_limit=1"
url = url % ( link , key , title )
res = requests . get ( url )
js = simplejson . loads ( res . content )
for movie in js [ "movies" ] :
print "rated: %s" % movie [ "mpaa_rating" ]
print "movie synopsis: " + movie [ "synopsis" ]
print "critics_consensus: " + movie [ "critics_consensus" ]
print "Major cast:"
for actor in movie [ "abridged_cast" ] :
print "%s as %s" % ( actor [ "name" ] , actor [ "characters" ] [ 0 ] )
ratings = movie [ "ratings" ]
print "runtime: %s" % movie [ "runtime" ]
print "critics score: %s" % ratings [ "critics_score" ]
print "audience score: %s" % ratings [ "audience_score " ]
print "for more information: %s" % movie [ "links" ] [ "alternate"]
saveData ( movie )
print"-" * 40
print
# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - - -- -- -- -- -- -- -- -- -- --
def getInTheaterMovies ( ) :
"" "
Get a list of movies in theaters .
"" "
today = datetime . datetime . today ( ) . strftime ( "%Y%m%d" )
config = ConfigObj ( "config.ini" )
if today != config [ "Settings" ] [ "last_downloaded" ] :
config [ "Settings" ] [ "last_downloaded" ] = today
try :
with open ( "config.ini" , "w" ) as cfg :
config . write ( cfg )
except IOError :
print "Error writing file!"
return
key = config [ "Settings" ] [ "api_key" ]
url = "http://api.rottentomatoes.com/api/public/v1.0/lists/movies/in_theaters.json?apikey=%s"
res = requests . get ( url % key )
data = res . content
js = simplejson . loads ( data )
movies = js [ "movies" ]
for movie in movies :
print movie [ "title" ]
getMovieDetails ( key , movie [ "title" ] )
print
# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - - -- -- -- -- -- -- -- -- -- --
def saveData ( movie ) :
"" "
Save the data to a SQLite database
"" "
if not os . path . exists ( "movies.db" ) :
# create the database
conn = sqlite3 . connect ( "movies.db" )
cursor = conn . cursor ( )
cursor . execute ( "" "CREATE TABLE movies
( title text , rated text , movie_synopsis text ,
critics_consensus text , runtime integer ,
critics_score integer , audience_score integer ) "" " )
cursor . execute ( "" "
CREATE TABLE cast
( actor text ,
character text )
"" " )
cursor . execute ( "" "
CREATE TABLE movie_cast
( movie_id integer ,
cast_id integer ,
FOREIGN KEY ( movie_id ) REFERENCES movie ( id ) ,
FOREIGN KEY ( cast_id ) REFERENCES cast ( id )
)
"" " )
else :
conn = sqlite3 . connect ( "movies.db" )
cursor = conn .cursor ( )
# insert the data
print
sql = "INSERT INTO movies VALUES(?, ?, ?, ?, ?, ?, ?)"
cursor . execute ( sql , ( movie [ "title" ] ,
movie [ "mpaa_rating" ] ,
movie [ "synopsis" ] ,
movie [ "critics_consensus" ] ,
movie [ "runtime" ] ,
movie [ "ratings"] ["critics_score" ] ,
movie [ "ratings" ] [ "audience_score" ]
)
)
movie_id = cursor . lastrowid
for actor in movie [ "abridged_cast" ] :
print "%s as %s" % ( actor [ "name" ] , actor [ "characters" ] [ 0 ] )
sql = "INSERT INTO cast VALUES(?, ?)"
cursor . execute ( sql , ( actor [ "name" ] ,
actor [ "characters"] [ 0 ]
)
)
cast_id = cursor . lastrowid
sql = "INSERT INTO movie_cast VALUES(?, ?)"
cursor . execute ( sql , ( movie_id , cast_id ) )
conn . commit ( )
conn . close ( )
# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - - -- -- -- -- -- -- -- -- -- --
if __name__ == "__main__" :
getInTheaterMovies ( )
如果您使用Firefox游览器,那么可以使用名为SQLite Manager的插件可视化我们创建的数据库。以下是本文实验产生的截图:
还有很多功能应该被添加。例如,我们需要getInTheaterMovies函数中的一些代码,如果我们已经获得当前数据,它将从数据库加载详细信息。我们还需要向数据库添加一些逻辑,以防止我们多次添加相同的演员或电影。如果我们有某种图形用户界面或网络界面,那将会很好。这些都是你可以添加的一些有趣的小练习。
顺便说一句,这篇文章的灵感来自于Michael Herman的Real Python for the Web一书。它有很多的想法和例子,你可以在这里查看。