Python 网络爬虫 学习笔记(3)

导读:

Beautiful Soup Quick Start

Beautiful Soup Navigating the tree

官方参考文档:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

资料来源:

莫烦PYTHON

官方文档

北京理工大学Python-MOOC-PPT

Quick Start

我们知道网页 (html)也是有一个基本构架, 爬网页就是在这个构架中找到需要的信息. 那么找到需要的信息时, BeautifulSoup 就是一个找信息好帮手. 它能帮你又快有准地找到信息. 大大简化了使用难度.

看看官方文档是怎么说的:

Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."

官方文档是如何描述该库的便利性的,用的是现在完成时表示仍然将持续产生影响!

code-block-1

可以说完成上面这个小实验基本能理解BeautifulSoup的作用啦~

themethod looks through a tag’s descendants and retrievesalldescendants that match your filters. I gave several examples inKinds of filters, but here are a few more:

soup.find_all("title")

# [The Dormouse's story]

soup.find_all("p","title")

# [

The Dormouse's story

]

soup.find_all("a")

# [

Elsie

,

# Lacie,

# Tillie]

soup.find_all(id="link2")

# [Lacie]

importresoup.find(string=re.compile("sisters"))

# u'Once upon a time there were three little sisters; and their names were\n'

get_text()

If you only want the text part of a document or tag, you can use themethod. It returns all the text in a document or beneath a tag, as a single Unicode string:

markup='\nI linked to example.com\n'

soup=BeautifulSoup(markup)

soup.get_text()

u'\nI linked to example.com\n'

soup.i.get_text()

u'example.com'

You can specify a string to be used to join the bits of text together:

# soup.get_text("|")

u'\nI linked to |example.com|\n'

You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text:

# soup.get_text("|", strip=True)

u'I linked to|example.com'

But at that point you might want to use the.stripped_stringsgenerator instead, and process the text yourself:

[textfortextinsoup.stripped_strings]

# [u'I linked to', u'example.com']

下面有些是北理工嵩天老师的PPT中对上述代码的Tips。

下面这张PPT说明我们还可以直接用BeautifulSoup去解析本地文件的用法,以及彼此之间的关系。

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about fourkindsof objects:,,, and.

If you can, I recommend you install and use lxml for speed.

Tags have a lot of attributes and methods, and I’ll cover most of them inNavigating the treeandSearching the tree. For now, the most important features of a tag are its name and attributes.

Using a tag name as an attribute will give you only thefirsttag by that name:

soup.a

# Elsie

If you need to getallthe tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described inSearching the tree, such asfind_all(:

soup.find_all('a')

# [Elsie,

# Lacie,

# Tillie]

Every tag has a name, accessible as

Pass in a value for and you’ll tell Beautiful Soup to only consider tags with certain names. Text strings will be ignored, as will tags whose names that don’t match.

This is the simplest usage:

soup.find_all("title")# [The Dormouse's story]

Recall fromKinds of filtersthat the value tocan bea string,a regular expression,a list,a function, orthe value True.

A tag may have any number of attributes. The taghas an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary

You can access that dictionary directly as

, Beautiful Soup will filter against each tag’s ‘id’ attribute

  • 发表于:
  • 原文链接https://kuaibao.qq.com/s/20180824G01WPC00?refer=cp_1026
  • 腾讯「云+社区」是腾讯内容开放平台帐号(企鹅号)传播渠道之一,根据《腾讯内容开放平台服务协议》转载发布内容。
  • 如有侵权,请联系 yunjia_community@tencent.com 删除。

扫码关注云+社区

领取腾讯云代金券