《手把手带你学爬虫──初级篇》第3课 Beautiful Soup 4 库讲解

GitOPEN

发布于 2019-01-29 10:18:04

2.6K0

发布于 2019-01-29 10:18:04

文章被收录于专栏：来自GitOPEN的技术文摘

本教程所有源码下载链接：https://share.weiyun.com/5xmFeUO 密码：fzwh6g

Beautiful Soup 4 库讲解

简介与安装

Beautiful Soup 4 库它是一个从HTML或者XML文件中提取数据的Python库。使用它，将极大地简化从网页源码中提取数据的步骤。

一个HTML或者XML文档就是一个标签树，使用bs4后，一个标签树就是一个BeautifulSoup类。

Beautiful Soup 4 库的安装：

pip install beautifulsoup4

Beautiful Soup 4 库基本使用方法

初体验

我们在ipython环境中体验一下：

In [1]: import requests
In [2]: r = requests.get("https://raw.githubusercontent.com/opengit/CrawlerLessons/master/codes/lesson03/bs4_demo01.html")
In [3]: r.text
Out[3]: '<!DOCTYPE html>\n<html>\n\n<head>\n    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n    <title>VPS推荐</title>\n</head>\n\n<body>\n<p class="title">\n    <b>亲测速度很快</b>\n</p>\n<p class="links">\n    下面是两个推荐的VPS服务器链接：\n    <a href="https://m.do.co/c/fd128f8ba9e8" class="vps1" id="link1">Digital Ocean优惠链接</a> 和\n    <a href="https://www.vultr.com/?ref=7147564" class="vps2" id="link2">Vultr优惠10美元链接</a>。\n</p>\n</body>\n</html>'
In [4]: demo = r.text
In [5]: from bs4 import BeautifulSoup
In [6]: soup = BeautifulSoup(demo, "html.parser")
In [7]: print(soup.prettify())
<!DOCTYPE html>
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <title>
   VPS推荐
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    亲测速度很快
   </b>
  </p>
  <p class="links">
   下面是两个推荐的VPS服务器链接：
   <a class="vps1" href="https://m.do.co/c/fd128f8ba9e8" id="link1">
    Digital Ocean优惠链接
   </a>
   和
   <a class="vps2" href="https://www.vultr.com/?ref=7147564" id="link2">
    Vultr优惠10美元链接
   </a>
   。
  </p>
 </body>
</html>

总结一下Beautiful Soup 4的最简单用法：

# 导入Beautiful Soup 4
from bs4 import BeautifulSoup
# 第1个参数是html文档内容，第2个参数是解析器
soup = BeautifulSoup(demo, "html.parser")

这里的html.parser是html解析器，有关解析器的用法，我们在后面讲解。

Beautiful Soup 4 库的元素

Beautiful Soup类的基本元素

<p class="title">
    <b>亲测速度很快</b>
</p>

<p>...</p> 叫做标签Tag
p叫做标签的Name
class="title"叫做属性域
class叫做属性的Name
title叫做属性的值

基本元素	含义
Tag	标签，最基本的单元，用<>和</>标明开头和结尾
Name	标签名字，<a></a>的名字是a，用法：<tag>.name
Attributes	标签属性，字典形式，用法：<tag>.attrs
NavigableString	标签内非属性字符串，<>...</>中的字符串，用法：<tag>.string
Comment	标签内字符串的注释部分

在ipython环境下，使用这些类的基本元素：

# 导入 Beautiful Soup 4
In [1]: from bs4 import BeautifulSoup
# 导入 requests
In [2]: import requests
# get请求网页
In [3]: r = requests.get("https://raw.githubusercontent.com/opengit/CrawlerLessons/master/codes/lesson03/bs4_demo01.html")

# html文本内容
In [4]: demo = r.text
# 拿到BeautifulSoup的对象
In [5]: soup = BeautifulSoup(demo, "html.parser")
# 获取a标签的名称
In [6]: soup.a.name
Out[6]: 'a'
# 获取a标签的父标签的名字
In [7]: soup.a.parent.name
Out[7]: 'p'
# 获取a标签的父标签的父标签的名字
In [8]: soup.a.parent.parent.name
Out[8]: 'body'
# soup.a的类型是Tag
In [9]: tag = soup.a
# 类型为Tag的a标签的属性
In [10]: tag.attrs
Out[10]: {'href': 'https://m.do.co/c/fd128f8ba9e8', 'class': ['vps1'], 'id': 'link1'}
# 取标签的属性值
In [11]: tag.attrs['class']
Out[11]: ['vps1']
In [12]: tag.attrs['href']
Out[12]: 'https://m.do.co/c/fd128f8ba9e8'
# Tag的属性返回的是一个字典类型
In [13]: type(tag.attrs)
Out[13]: dict
# soup.a 是一个Tag类型
In [14]: type(tag)
Out[14]: bs4.element.Tag
In [15]: tag
Out[15]: <a class="vps1" href="https://m.do.co/c/fd128f8ba9e8" id="link1">Digital Ocean优惠链接</a>
# 标签中的字符串
In [16]: tag.string
Out[16]: 'Digital Ocean优惠链接'
In [17]: soup.p
Out[17]:
<p class="title">
<b>亲测速度很快</b>
</p>
In [18]: soup.p.string
In [19]: soup.p.b.string
Out[19]: '亲测速度很快'
In [20]: type(soup.p.b.string)
# 标签中的字符串是NavigableString类型
Out[20]: bs4.element.NavigableString

# Comment 示例
In [25]: newsoup = BeautifulSoup("<b><!--我是注释--></b><p>我是注释</p>","html.parser")
In [26]: newsoup.b.string
Out[26]: '我是注释'
In [27]: type(newsoup.b.string)
Out[27]: bs4.element.Comment
In [29]: newsoup.p.string
Out[29]: '我是注释'
In [30]: type(newsoup.p.string)
Out[30]: bs4.element.NavigableString

Beautiful Soup 4 库的解析器

解析器	使用方法	优势	劣势	条件
Python标准库	BeautifulSoup(markup, "html.parser")	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差	直接使用
lxml HTML 解析器	BeautifulSoup(markup, "lxml")	速度快文档容错能力强	需要安装C语言库	pip install lxml
lxml XML 解析器	BeautifulSoup(markup, ["lxml", "xml"])，BeautifulSoup(markup, "xml")	速度快唯一支持XML的解析器	需要安装C语言库	pip install lxml
html5lib	BeautifulSoup(markup, "html5lib")	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展	pip install html5lib

Beautiful Soup 4 库对HTML内容进行遍历

HTML基本格式

事实上，HTML的基本格式是一种树形结构/标签树。上面的例子中，使用的html的结构如图：

下行遍历

所谓下行遍历，就是从父节点向子节点进行遍历的方法。Beautiful Soup 4中有这个属性可以用在下行遍历中：

属性	含义
.contents	子节点的列表，是列表类型，将<tag>的所有子节点存入列表
.children	迭代类型，包含了所有子节点，可用于遍历循环子节点
.descendants	迭代类型，包含了所有子孙节点，用于遍历循环

下面在ipython中测试一下：

In [33]: r = requests.get("https://raw.githubusercontent.com/opengit/CrawlerLessons/master/codes/lesson03/bs4_demo01.html")
In [34]: demo = r.text
In [35]: soup = BeautifulSoup(demo, "html.parser")
In [36]: soup.head
Out[36]: <head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><title>VPS推荐</title></head>

# head标签的子节点列表
In [37]: soup.head.contents
Out[37]:
[<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>,
 <title>VPS推荐</title>]
# body标签的子节点列表
In [38]: body = soup.body.contents
In [39]: body
Out[39]:
[<p class="title"><b>亲测速度很快</b></p>,
 <p class="links">下面是两个推荐的VPS服务器链接：<a class="vps1" href="https://m.do.co/c/fd128f8ba9e8" id="link1">Digital Ocean优惠链接</a> 和<a class="vps2" href="https://www.vultr.com/?ref=7147564" id="link2">Vultr优惠10美元链接</a>。</p>]

# body标签的子节点列表的长度
In [40]: len(body)
Out[40]: 2

# body的所有子孙节点的迭代类型generator
In [41]: bodys = soup.body.descendants
In [42]: bodys
Out[42]: <generator object descendants at 0x104e2c0a0>
# 遍历子孙节点
In [43]: for item in bodys:
    ...:     print(item)
    ...:
<p class="title"><b>亲测速度很快</b></p>
<b>亲测速度很快</b>
亲测速度很快
<p class="links">下面是两个推荐的VPS服务器链接：<a class="vps1" href="https://m.do.co/c/fd128f8ba9e8" id="link1">Digital Ocean优惠链接</a> 和<a class="vps2" href="https://www.vultr.com/?ref=7147564" id="link2">Vultr优惠10美元链接</a>。</p>
下面是两个推荐的VPS服务器链接：
<a class="vps1" href="https://m.do.co/c/fd128f8ba9e8" id="link1">Digital Ocean优惠链接</a>
Digital Ocean优惠链接
 和
<a class="vps2" href="https://www.vultr.com/?ref=7147564" id="link2">Vultr优惠10美元链接</a>
Vultr优惠10美元链接
。
# body标签的子节点迭代类型
In [44]: bodyc = soup.body.children
# 遍历
In [45]: for item in bodyc:
    ...:     print(item)
    ...:
<p class="title"><b>亲测速度很快</b></p>
<p class="links">下面是两个推荐的VPS服务器链接：<a class="vps1" href="https://m.do.co/c/fd128f8ba9e8" id="link1">Digital Ocean优惠链接</a> 和<a class="vps2" href="https://www.vultr.com/?ref=7147564" id="link2">Vultr优惠10美元链接</a>。</p>

上行遍历

标签树从下往上进行遍历。上行遍历的几个属性：

属性	含义
.parent	<tag>的父亲标签
.parents	<tag>的先辈们标签的迭代类型，用于遍历循环

接着上面的例子，下面在ipython中进行测试：

In [46]: soup.title.parent
Out[46]: <head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><title>VPS推荐</title></head>
# 由于html标签是最外层标签，因此html标签的父标签是它自己
In [47]: soup.html.parent
Out[47]:
<!DOCTYPE html>

<html><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><title>VPS推荐</title></head><body><p class="title"><b>亲测速度很快</b></p><p class="links">下面是两个推荐的VPS服务器链接：<a class="vps1" href="https://m.do.co/c/fd128f8ba9e8" id="link1">Digital Ocean优惠链接</a> 和<a class="vps2" href="https://www.vultr.com/?ref=7147564" id="link2">Vultr优惠10美元链接</a>。</p></body></html>
# soup的父标签为空
In [48]: soup.parent
# 遍历
In [49]: for item in soup.html.parents:
    ...:     print(item)
    ...:
<!DOCTYPE html>

<html><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><title>VPS推荐</title></head><body><p class="title"><b>亲测速度很快</b></p><p class="links">下面是两个推荐的VPS服务器链接：<a class="vps1" href="https://m.do.co/c/fd128f8ba9e8" id="link1">Digital Ocean优惠链接</a> 和<a class="vps2" href="https://www.vultr.com/?ref=7147564" id="link2">Vultr优惠10美元链接</a>。</p></body></html>

平行遍历

前提：参与平行遍历的节点必须拥有同一个父亲节点

属性	含义
.next_sibling	返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling	返回按照HTML文本顺序的上一个平行节点标签
.next_siblings	迭代类型，返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings	迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

在ipython中使用这些属性：

In [50]: soup.body.contents
Out[50]:
[<p class="title"><b>亲测速度很快</b></p>,
 <p class="links">下面是两个推荐的VPS服务器链接：<a class="vps1" href="https://m.do.co/c/fd128f8ba9e8" id="link1">Digital Ocean优惠链接</a> 和<a class="vps2" href="https://www.vultr.com/?ref=7147564" id="link2">Vultr优惠10美元链接</a>。</p>]

In [51]: soup.body.contents[0].next_sibling
Out[51]: <p class="links">下面是两个推荐的VPS服务器链接：<a class="vps1" href="https://m.do.co/c/fd128f8ba9e8" id="link1">Digital Ocean优惠链接</a> 和<a class="vps2" href="https://www.vultr.com/?ref=7147564" id="link2">Vultr优惠10美元链接</a>。</p>

In [52]: soup.body.contents[1].next_sibling

In [53]: soup.body.contents[1].previous_sibling
Out[53]: <p class="title"><b>亲测速度很快</b></p>

In [54]: soup.a.next_sibling
Out[54]: ' 和'

In [55]: soup.a.next_sibling.next_sibling
Out[55]: <a class="vps2" href="https://www.vultr.com/?ref=7147564" id="link2">Vultr优惠10美元链接</a>

In [56]: soup.a.previousSibling
Out[56]: '下面是两个推荐的VPS服务器链接：'

In [57]: soup.a.previous_sibling
Out[57]: '下面是两个推荐的VPS服务器链接：'

In [58]: soup.a.previous_sibling.previousSibling

In [59]: soup.a.next_sibling.previous_sibling
Out[59]: <a class="vps1" href="https://m.do.co/c/fd128f8ba9e8" id="link1">Digital Ocean优惠链接</a>

Beautiful Soup 4 库对HTML格式化输出

也就是让输出的HTML更加好看。使用prettify()方法：

In [60]: r.text
Out[60]: '<!DOCTYPE html>\n<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><title>VPS推荐</title></head><body><p class="title"><b>亲测速度很快</b></p><p class="links">下面是两个推荐的VPS服务器链接：<a href="https://m.do.co/c/fd128f8ba9e8" class="vps1" id="link1">Digital Ocean优惠链接</a> 和<a href="https://www.vultr.com/?ref=7147564" class="vps2" id="link2">Vultr优惠10美元链接</a>。</p></body></html>'

In [61]: soup.prettify()
Out[61]: '<!DOCTYPE html>\n<html>\n <head>\n  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>\n  <title>\n   VPS推荐\n  </title>\n </head>\n <body>\n  <p class="title">\n   <b>\n    亲测速度很快\n   </b>\n  </p>\n  <p class="links">\n   下面是两个推荐的VPS服务器链接：\n   <a class="vps1" href="https://m.do.co/c/fd128f8ba9e8" id="link1">\n    Digital Ocean优惠链接\n   </a>\n   和\n   <a class="vps2" href="https://www.vultr.com/?ref=7147564" id="link2">\n    Vultr优惠10美元链接\n   </a>\n   。\n  </p>\n </body>\n</html>'

In [62]: print(soup.prettify())
<!DOCTYPE html>
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <title>
   VPS推荐
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    亲测速度很快
   </b>
  </p>
  <p class="links">
   下面是两个推荐的VPS服务器链接：
   <a class="vps1" href="https://m.do.co/c/fd128f8ba9e8" id="link1">
    Digital Ocean优惠链接
   </a>
   和
   <a class="vps2" href="https://www.vultr.com/?ref=7147564" id="link2">
    Vultr优惠10美元链接
   </a>
   。
  </p>
 </body>
</html>

Beautiful Soup 4 库常用函数

<tag>.find_all(name, attrs, recursive, string, **kwargs)

返回一个列表类型，存储查找的结果。

name：标签名字检索字符串，可以为列表形式，包含多个检索字符串；

attrs：标签属性值的检索字符串，可标注属性检索；

recursive：是否对子孙节点全部检索，默认值是True；

string：<>...</>中字符串区域的检索字符串；

**kwargs：；

在ipython中进行测试：

In [63]: for link in soup.find_all('a'):
    ...:     print(link.get('href'))
    ...:
https://m.do.co/c/fd128f8ba9e8
https://www.vultr.com/?ref=7147564

In [64]: soup.find_all(['a','b'])
Out[64]:
[<b>亲测速度很快</b>,
 <a class="vps1" href="https://m.do.co/c/fd128f8ba9e8" id="link1">Digital Ocean优惠链接</a>,
 <a class="vps2" href="https://www.vultr.com/?ref=7147564" id="link2">Vultr优惠10美元链接</a>]

In [65]: for tag in soup.find_all(True):
    ...:     print(tag.name)
    ...:
html
head
meta
title
body
p
b
p
a
a

# 利用正则表达式，打印出所有以b开头的标签名称
In [66]: import re

In [67]: for tag in soup.find_all(re.compile('b')):
    ...:     print(tag.name)
    ...:
body
b

# 查找所有a标签，列表形式返回
In [72]: soup.find_all('a')
Out[72]:
[<a class="vps1" href="https://m.do.co/c/fd128f8ba9e8" id="link1">Digital Ocean优惠链接</a>,
 <a class="vps2" href="https://www.vultr.com/?ref=7147564" id="link2">Vultr优惠10美元链接</a>]

# 查找所有a标签中，id='link1'的a标签，返回列表
In [73]: soup.find_all('a',id='link1')
Out[73]: [<a class="vps1" href="https://m.do.co/c/fd128f8ba9e8" id="link1">Digital Ocean优惠链接</a>]
# 查找所有a标签中，id='link2'的a标签，返回列表
In [74]: soup.find_all('a',id='link2')
Out[74]: [<a class="vps2" href="https://www.vultr.com/?ref=7147564" id="link2">Vultr优惠10美元链接</a>]
# 查找所有p标签中，id='link1'的a标签，返回列表，由于p标签没有带id='link1'的，所有列表中没有元素
In [75]: soup.find_all('p',id='link2')
Out[75]: []

# 输出所有id的值以link开头的标签
In [76]: soup.find_all(id=re.compile('link'))
Out[76]:
[<a class="vps1" href="https://m.do.co/c/fd128f8ba9e8" id="link1">Digital Ocean优惠链接</a>,
 <a class="vps2" href="https://www.vultr.com/?ref=7147564" id="link2">Vultr优惠10美元链接</a>]

# 查找所有字符串中含有‘优惠’的字符串
In [78]: soup.find_all(string = re.compile('优惠'))
Out[78]: ['Digital Ocean优惠链接', 'Vultr优惠10美元链接']

下面介绍一下Beautiful Soup 4的其他方法：

函数名	含义	参数
<tag>.find()	返回字符串类型，搜索并返回一个结果	同find_all()参数
<tag>.find_parent()	返回字符串类型，在先辈中返回一个结果	同find()参数
<tag>.find_parents()	返回列表类型，在先辈中搜索	同find_all()参数
<tag>.find_next_sibling()	返回字符串类型，在后续平行节点中返回一个结果	同find()参数
<tag>.find_next_siblings()	返回列表类型，在后续平行节点中搜索	同find_all()参数
<tag>.find_previous_sibling()	返回字符串类型，在前续平行节点中返回一个结果	同find()参数
<tag>.find_previous_siblings()	返回列表类型，在前续平行节点中搜索	同find_all()参数

Beautiful Soup 4 库补充知识

<tag>.string用法总结：
- 如果tag只有一个NavigableString类型子节点（文本内容），那么将得到该子节点；
- 如果tag只有一个子节点，那么.string得到的结果，和上面的结果一样；
- 如果tag包含多个子节点，tag就无法确定.string方法该调用哪个节点的内容，因此结果是None。
<tag>.strings和<tag>.stripped_strings用法：
- 如果tag中包含多个字符串，可以使用.strings来循环遍历，输出的字符串中可能包含很多空格或空行；
- 使用.stripped_strings可以去除多余空白内容，全部是空格的行会被忽略掉，段首和段尾的空白会被删除。
CSS选择器，Beautiful Soup 4 支持大部分的CSS选择器，在select()方法中传入字符串参数即可使用： #link1 是id选择器；.sister是class类选择器。
- 通过tag标签逐层查找： soup.select("body a")
- 找到某个tag标签下的直接子标签： soup.select("head > title") soup.select("p > #link1")
- 找到兄弟节点标签： # 找到所有兄弟节点 soup.select("#link1 ~ .sister") # 找到下一个兄弟节点 soup.select("#link1 + .sister")
- 通过CSS的类名查找： soup.select(".sister") soup.select("[class~=sister]")
- 通过tag的id查找： soup.select("#link1") soup.select("a#link2")
- 通过是否存在某个属性查找： soup.select('a[href]')
- 通过属性的值来查找： soup.select('a[href="http://example.com/elsie"]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] # 属性的值开头含有某字符串 soup.select('a[href^="http://example.com/"]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] # 属性的值结尾含有某字符串 soup.select('a[href$="tillie"]') # [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] # 属性的值中间含有某字符串 soup.select('a[href*=".com/el"]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
<tag>.get_text()方法，如果只想得到tag中包含的文本内容，那么可以调用这个方法，获取到tag中包含的所有文本内容，包括子孙tag中的内容，并将结果作为Unicode字符串返回。

实战——用Beautiful Soup 4 库爬取豆瓣电影排行榜Top250

豆瓣电影Top250的页面地址是https://movie.douban.com/top250?start=0，打开页面以后，发现页面下方有页面页码导航，因此，访问不同页面的数据的策略就是更改`start=0`的值，每个页面25条数据。

本实战代码如下：

import json

import requests
from bs4 import BeautifulSoup

### 可以改进的地方：
### 我们拿到的数据是在列表中拿到的，并不完善；
### 我们的作业是，拿到电影详情url以后，访问该url，从页面中爬取更多的信息。

def get_html(url):
    """
    获取网页源码
    :param url: 网页请求链接
    :return: 返回网页源码
    """
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except Exception as e:
        print("产生了异常：{}".format(str(e)))


def parse_movie_by_bs4(html):
    """
    处理网页源码，使用Beautiful Soup 4提取电影信息
    :param html: 网页源码
    :return: 当前页电影信息的列表
    """
    # 列表形式，用以存储抽取出来的数据，每个数据是字典形式
    page_movies = []

    # 处理数据

    # 1. 把html文档内容交给Beautiful Soup
    soup = BeautifulSoup(html, "lxml")
    # 2.查找所有class='info'的div节点
    div_infos = soup.find_all('div', {'class': 'info'})
    for div_info in div_infos:
        # 用字典保存
        movie = {}

        # 3. 查找拥有class="hd"属性值的节点
        div_hd = div_info.find('div', {'class': 'hd'})
        # 4. 查找拥有class="bd"属性值的节点
        div_bd = div_info.find('div', {'class': 'bd'})

        # 5.从div_hd中取出url和title
        movie['url'] = div_hd.a.attrs['href']
        title = ''
        for span in div_hd.a.contents:
            title += str(span.string)
        movie['title'] = ''.join(title.split())

        # 6. 从div_bd中取出 导演（boss） 、主演（role）、年份（year）、国家（nation）、类别（category）、 引用评价(quote)
        p1 = div_bd.find('p', {'class': ''})
        movie['info'] = ",".join(p1.get_text(',', strip=True).split()).replace(":", ",").replace(",,", ",").replace("/,", "")

        div_star = div_bd.find('div')
        movie['rating_num'] = div_star.find_all('span')[1].string
        movie['valuation_num'] = str(div_star.find_all('span')[3].string).replace("人评价", "")

        p2 = div_bd.find('p', {'class': 'quote'})
        if p2 is not None:
            movie['quote'] = p2.span.string
        else:
            movie['quote'] = ""

        page_movies.append(movie)

    return page_movies


def main():
    """
    控制url，循环爬取所有页面，得到所有电影信息
    """
    bash_url = 'https://movie.douban.com/top250?start='
    offset = 0
    all_movies = []
    page_id = 0
    while offset < 250:
        print("正在爬取第{}页".format(str(page_id + 1)))
        url = bash_url + str(offset)
        html_text = get_html(url)
        page_movies = parse_movie_by_bs4(html_text)
        all_movies.append(page_movies)
        offset += 25
        page_id += 1
    with open("douban_250.json", "w") as filename:
        filename.write(json.dumps(all_movies, ensure_ascii=False))
    print("所有数据爬取完毕")


if __name__ == '__main__':
    main()

参考资料推荐

官方文档：《Beautiful Soup 4.2.0 文档》

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2018/09/14，如有侵权请联系 cloudcommunity@tencent.com 删除

爬虫

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

爬虫

登录后参与评论

0 条评论

热度