Python爬虫扩展库BeautifulSoup4用法精要

BeautifulSoup是一个非常优秀的Python扩展库,可以用来从HTML或XML文件中提取我们感兴趣的数据,并且允许指定使用不同的解析器。由于beautifulsoup3已经不再继续维护,因此新的项目中应使用beautifulsoup4,目前最新版本是4.5.0,可以使用pip install beautifulsoup4直接进行安装,安装之后应使用from bs4 import BeautifulSoup导入并使用。下面我们就一起来简单看一下BeautifulSoup4的强大功能,更加详细完整的学习资料请参考https://www.crummy.com/software/BeautifulSoup/bs4/doc/。

>>> from bs4 import BeautifulSoup

>>> BeautifulSoup('hello world!', 'lxml') #自动添加和补全标签

<html><body><p>hello world!</p></body></html>

>>> html_doc = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

>>> soup = BeautifulSoup(html_doc, 'html.parser') #也可以使用lxml或其他解析器

>>> print(soup.prettify()) #以优雅的方式显示出来

<html>

<head>

<title>

The Dormouse's story

</title>

</head>

<body>

<p class="title">

<b>

The Dormouse's story

</b>

</p>

<p class="story">

Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1">

Elsie

</a>

,

<a class="sister" href="http://example.com/lacie" id="link2">

Lacie

</a>

and

<a class="sister" href="http://example.com/tillie" id="link3">

Tillie

</a>

;

and they lived at the bottom of a well.

</p>

<p class="story">

...

</p>

</body>

</html>

>>> soup.title #访问特定的标签

<title>The Dormouse's story</title>

>>> soup.title.name #标签名字

'title'

>>> soup.title.text #标签文本

"The Dormouse's story"

>>> soup.title.string

"The Dormouse's story"

>>> soup.title.parent #上一级标签

<head><title>The Dormouse's story</title></head>

>>> soup.head

<head><title>The Dormouse's story</title></head>

>>> soup.b

<b>The Dormouse's story</b>

>>> soup.body.b

<b>The Dormouse's story</b>

>>> soup.name #把整个BeautifulSoup对象看做标签对象

'[document]'

>>> soup.body

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

</body>

>>> soup.p

<p class="title"><b>The Dormouse's story</b></p>

>>> soup.p['class'] #标签属性

['title']

>>> soup.p.get('class') #也可以这样查看标签属性

['title']

>>> soup.p.text

"The Dormouse's story"

>>> soup.p.contents

[<b>The Dormouse's story</b>]

>>> soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

>>> soup.a.attrs #查看标签所有属性

{'class': ['sister'], 'href': 'http://example.com/elsie', 'id': 'link1'}

>>> soup.find_all('a') #查找所有<a>标签

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

>>> soup.find_all(['a', 'b']) #同时查找<a>和<b>标签

[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

>>> import re

>>> soup.find_all(href=re.compile("elsie")) #查找href包含特定关键字的标签

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

>>> soup.find(id='link3')

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

>>> soup.find_all('a', id='link3')

[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

>>> for link in soup.find_all('a'):

print(link.text,':',link.get('href'))

Elsie : http://example.com/elsie

Lacie : http://example.com/lacie

Tillie : http://example.com/tillie

>>> print(soup.get_text()) #返回所有文本

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

>>> soup.a['id'] = 'test_link1' #修改标签属性的值

>>> soup.a

<a class="sister" href="http://example.com/elsie" id="test_link1">Elsie</a>

>>> soup.a.string.replace_with('test_Elsie') #修改标签文本

'Elsie'

>>> soup.a.string

'test_Elsie'

>>> print(soup.prettify())

<html>

<head>

<title>

The Dormouse's story

</title>

</head>

<body>

<p class="title">

<b>

The Dormouse's story

</b>

</p>

<p class="story">

Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="test_link1">

test_Elsie

</a>

,

<a class="sister" href="http://example.com/lacie" id="link2">

Lacie

</a>

and

<a class="sister" href="http://example.com/tillie" id="link3">

Tillie

</a>

;

and they lived at the bottom of a well.

</p>

<p class="story">

...

</p>

</body>

</html>

>>> for child in soup.body.children: #遍历直接子标签

print(child)

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="test_link1">test_Elsie</a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

>>> for string in soup.strings: #遍历所有文本,结果略

print(string)

>>> test_doc = '<html><head></head><body><p></p><p></p></body></heml>'

>>> s = BeautifulSoup(test_doc, 'lxml')

>>> for child in s.html.children: #遍历直接子标签

print(child)

<head></head>

<body><p></p><p></p></body>

>>> for child in s.html.descendants: #遍历子孙标签

print(child)

<head></head>

<body><p></p><p></p></body>

<p></p>

<p></p>

原文发布于微信公众号 - Python小屋(Python_xiaowu)

原文发表时间:2016-12-29

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏hotqin888的专栏

MeritMS+jQuery.Gantt价值管理系统增加项目进度展示

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/hotqin888/article/det...

32110
来自专栏

Flex 上传文件,服务端.net

using System; using System.Data; using System.Configuration; using System.Collec...

7710
来自专栏Kubernetes

Kubernetes GC in v1.3

本文是对kubernetes GC proposal的解读分析,是对GC in kubernetes v1.3的内部结构剖析,并记录了其中一些关键点,以便日后能...

30850
来自专栏运维

DELL R710 服务器内存排错

man dmidecode 可以得到详细的介绍和使用方法,dmidecode - DMI table decoder,DMI (Desktop Manageme...

43220
来自专栏一“技”之长

iOS NSTimer 定时器用法总结 原

NSTimer在IOS开发中会经常用到,尤其是小型游戏,然而对于初学者时常会注意不到其中的内存释放问题,将其基本用法总结如下:

7410
来自专栏杂烩

Arthasa应用 原

Arthasa是个好东西,用熟了确实一大助力,本文主要描述如何用Arthasa解决问题(官网https://alibaba.github.io/arthas)。

41120
来自专栏浅探ARKit

ARKit中控制.dae动画的播放

4.用时间控制动画--CAAnimation 里的 timeOffset 控制开始时间 duration控制播放时间

78970
来自专栏帘卷西风的专栏

关于cocos2dx之lua使用TableView

在手机游戏的开发中,滚动是一项非常重要的操作,而cocos2dx中使用的最广泛的就属于TableView了,不过由于cocos2dx的接口比较晦涩,所以需要一...

19220
来自专栏Golang语言社区

Golang生产级可靠UDP库

kcp-go is a Production-Grade Reliable-UDP library for golang.

82820
来自专栏WindCoder

Best Programming Editors? A Never Ending Battle With No Clear Winner

原文:Best Programming Editors? A Never Ending Battle With No Clear Winner

7810

扫码关注云+社区

领取腾讯云代金券