网络爬虫 | Beautiful Soup解析数据模块

数据STUDIO

发布于 2021-06-24 10:29:44

5710

发布于 2021-06-24 10:29:44

文章被收录于专栏：数据STUDIO

从HTML文件中提取数据，除了使用XPath，另一种比较常用的解析数据模块。Beautiful Soup模块中查找提取功能非常强大、方便，且提供一些简单的函数来导航、搜索、修改分析树等功能。Beautiful Soup模块是Python的一个HTML解析库，借助网页的结构和属性来解析网页（比正则表达式简单、有效）。Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。

解析器

Beautiful Soup支持的解析器

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, "html.parser")	Python的内置标准库执行速度适中文档容错能力强	Python3.2.2前的版本中文文档容错能力差
lxml HTML解析器	BeautifulSoup(markup, "lxml")	速度快文档容错能力强	需要安装C语言库
lxml XML解析器	BeautifulSoup(markup, "lxml-xml")BeautifulSoup(markup, "xml")	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup,"html5lib")	最好的容错性以浏览器的方式解析生成HTML5格式的文档	速度慢，不依赖外部扩展

应用

from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie,</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" 
soup = BeautifulSoup(html,features='lxml')#对html进行解析，完成初始化
print(soup.prettify())#字符串按标准缩进格式输出，自动进行格式更正

输出结果

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie,
   </a>
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

bs4节点选择器

直接获取

直接调用节点名称，在调用对应的string属性则可以获取到节点内的文本信息。在单个节点结构层次非常清晰的情况下，使用这种方式提取节点信息的速度非常快。

>>> soup.head            # 选择head节点
<head><title>The Dormouse's story</title></head>

>>> soup.title           # 选择title节点
<title>The Dormouse's story</title>

>>> soup.title.string    # 选择节点内的文本
"The Dormouse's story"

>>> soup.p               # 默认找到是第一个节点
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

>>> soup.p.name          #节点的名字
'p'
>>> soup.title.name      # 节点的名字
'title'

>>> soup.p.attrs         # 选择节点的属性
{'class': ['title'], 'name': 'dromouse'}
>>> soup.p.attrs['name'] # 获得属性值
'dromouse'
>>> soup.p['name']       # 获得属性值,"attrs"可以省略
'dromouse'

>>> soup.body.p.b        # 通过"."，嵌套选择，直接选取下一节点的内容，选择更精准
<b>The Dormouse's story</b>
>>> type(soup.body.p.b)
bs4.element.Tag
# 在获取head与其内部的title节点内容时，数据类型均为"bs4.element.Tag"
# 说明在tag类型的基础上可以获取当前节点的子节点内容，即嵌套选择。

注意，选择并输出p节点对应代码时，只会打印第一个p节点的内容，尽管他有多个节点，都将会被忽略。

关联获取

先确认某个节点，再以此作为中心节点，获取其子节点、孙节点、父节点、兄弟节点。

获取子节点

通过使用contents 或children属性来实现。

>>> soup.head.contents
[<title>The Dormouse's story</title>]
# 列表形式，其中每个元素都是一个子节点内容
>>> soup.head.children
 <list_iterator at 0x7fcb2a242e50>
# 可迭代对象，head下所有子节点
>>> [i for i in soup.head.children]
>>> list(soup.head.children)
[<title>The Dormouse's story</title>]
# 直接将其转换为list类型

获取子孙节点

通过使用descendants 属性来实现。

>>> html_str = """
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
</body>
</html>
"""
>>> soup = BeautifulSoup(html_str,'lxml')#对html进行解析，完成初始化
>>> soup.body.descendants
<generator object Tag.descendants at 0x7fcb2a264750>
# 所有子孙节点内容的generator对象
>>> for i in soup.body.descendants:
...     print(i)

获取父节点

通过parent属性直接获取指定节点的父节点内容；通过parents属性获取指定节点的父节点以及祖先节点内容。返回generator对象

>>> soup.title.parent
<head>
<title>
   The Dormouse's story
  </title>
</head>
>>> soup.title.parents
<generator object PageElement.parents at 0x7fcb2a264ed0>

# 循环遍历可迭代对象中的所有父节点以及内容
>>> for i in soup.title.parents:
...     print(i.name)  # 打印父节点及祖先节点名称
...     print('-'*20)
head
--------------------
html
--------------------
[document]
--------------------

获取兄弟（同级）节点

通过next_sibling属性获取当前节点的下个兄弟节点；通过previous_sibling属性 获取当前节点的上个兄弟节点。

>>> html_str = """
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story"></p>
   Once upon a time there were three little sisters; and their names were
</body>
</html>
"""
>>> soup = BeautifulSoup(html_str,'lxml')#对html进行解析，完成初始化
>>> soup.p
<p class="title" name="dromouse">
<b>
    The Dormouse's story
   </b>
</p>
# 如果两个节点之间含义换行符(\n)、空字符或者其他文本内容时，将返回这些文本节点。
>>> soup.p.next_sibling 
'\n'

>>> p = soup.p.next_sibling.next_sibling
>>> p
<p class="story"></p>

>>> p.previous_sibling
'\n'

# 使用next_siblings属性获取当前节点后面所有兄弟节点
# 使用previous_siblings属性获取当前节点前面所有兄弟节点
>>> soup.p.next_siblings
<generator object PageElement.next_siblings at 0x7fcb2c4cae50>
# 返回可迭代对象
>>> for i in soup.p.next_siblings:
...     print(i)
...     print('-'*20)
--------------------
<p class="story"></p>
--------------------

   Once upon a time there were three little sisters; and their names were

--------------------

方法选择器

find()--获取第一个匹配节点内容

soup.find(name=None, attrs={}, recursive=True, text=None, **kwargs)

name: 指定节点名称，并返回符合条件的第一个内容 attrs: 通过指定属性进行数据的获取工作，可直接填写字典类型的参数，亦可通过赋值的方式填写参数。 text: 指定text参数可以获取节点中的文本，该参数可以指定字符串或者正则表达式对象。

find_all()--获取所有符合条件的内容

soup.find_all(name=None, attrs={}, recursive=True, text=None, **kwargs)

name: 指定节点名称，返回一个可迭代对象，所有符合条件的内容均为对象中的一个元素。 attrs: 通过指定属性进行数据的获取工作，可直接填写字典类型的参数，亦可通过赋值的方式填写参数。 text: 指定text参数可以获取节点中的文本，该参数可以指定字符串或者正则表达式对象。

name参数

>>> from bs4 import BeautifulSoup
>>> html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie,</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" 
#对html进行解析，完成初始化
>>> soup = BeautifulSoup(html,'lxml')

#根据节点名字查找，首次出现
>>> soup.find(name='p')
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

#根据节点名字查找,查找所有
>>> soup.find_all(name='p') #简写，可省略name soup.find_all('p')
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie,</a>
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]
# 查看数据类型
>>> type(soup.find_all(name='p'))
<class 'bs4.element.ResulSet'>

# 可以使用切片的方式获取'bs4.element.ResulSet'对象中的内容，这与python中列表类似
>>> soup.find_all(name='p')[0]
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>,

# 查看数据类型
>>> type(soup.find_all(name='p')[0])
bs4.element.Tag

# 因为bs4.element.ResulSet'对象中的每个元素的数据类型均为'bs4.element.Tag'
# 可以直接对某个元素进行嵌套获取
# 获取第二个'p'节点内的所有子节点'a'
>>> soup.find_all(name='p')[1].find_all(name='a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie,</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

attrs参数

# 通过赋值的方式填写参数
# 通过id名查找，查找id名为link1的节点
>>> soup.find_all(id='link1')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie,</a>]

#通过类名查找，查找类名为sister的节点
>>> soup.find_all(class_='sister')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie,</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# 通过字典类型的参数值
>>> soup.find_all(name='p',attrs={'class':'story'})#组合，定位更精准
[<p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie,</a>
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
 and they lived at the bottom of a well.</p>, <p class="story">...</p>]

# 输出节点内的内容
>>> soup.find_all(id='link1')[0].text 
'Elsie,'
# 输出节点内的内容
>>> soup.find_all(id='link1')[0].string
'Elsie,'

注意，获取class属性时，因其与python中类class重名，所以通过赋值的方式填写参数时需写成class_。而通过字典方式传递参数时，直接写'class' 即可。

text参数

>>> soup.find_all(text='Lacie')
['Lacie']

>>> import re
>>> soup.find(text=re.compile('e'))
"The Dormouse's story"

CSS选择器

可以直接调用select()方法并填写参数，通过CSS选择器对Tag或者BeautifulSoup对象来获取节点内容。

可参考https://www.w3school.com.cn/sccref/css_selectors.asp

直接填写字符串类型的节点名称
.class 指定class属性值
#id 指定id属性的值

select_one()方法 用户获取所有符合条件节点中的第一个节点。

# 根据节点名字选择，选择所有
>>> soup.select('title') 
[<title>The Dormouse's story</title>]
 
>>> soup.select('p')
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie,</a>
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]
 
 # 根据节点名字选择，并通过切片方式选择第0个p节点
 >>> soup.select('p')[0]
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
 
# 根据类名，选择类名为sister的所有节点
>>> soup.select('.sister') 
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie,</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# 根据id名，选择id名为link1的所有节点
>>> soup.select('#link1')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie,</a>]

# 组合选择，用空格分开，选择更精准
>>> soup.select('p #link2') 
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
 
soup.select('p .sister')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie,</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

根据条件获取节点内容的其他方法

获取节点方式	描述
Soup.select('div[class="title"]')[0].select('p')[0]	嵌套获取class名为title对应的div中所有p节点中的第一个
soup.select('p')[0][value]soup.select('p')[0].attrs['value']	获取所有p节点中第一个节点内value属性对应的值（两种方式）
soup.select('p')[0].get_text()soup.select('p')[0].string	获取所有p节点中第一个节点内文本（两种方式）
soup.select('p')[1:]	获取所有p节点中第二个后的p节点
Soup.select('.sister, .brother')	获取class名为sister与brother对应的节点
soup.select('a[href]')	获取存在href属性的所有a节点
soup.select('p[value="orange"]')	获取所有属性值为value='orange'的p节点