Beautiful Soup Quick Start
Beautiful Soup Navigating the tree
我们知道网页 (html)也是有一个基本构架, 爬网页就是在这个构架中找到需要的信息. 那么找到需要的信息时, BeautifulSoup 就是一个找信息好帮手. 它能帮你又快有准地找到信息. 大大简化了使用难度.
Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."
themethod looks through a tag’s descendants and retrievesalldescendants that match your filters. I gave several examples inKinds of filters, but here are a few more:
# [The Dormouse's story]
The Dormouse's story
# u'Once upon a time there were three little sisters; and their names were\n'
If you only want the text part of a document or tag, you can use themethod. It returns all the text in a document or beneath a tag, as a single Unicode string:
markup='\nI linked to example.com\n'
u'\nI linked to example.com\n'
You can specify a string to be used to join the bits of text together:
u'\nI linked to |example.com|\n'
You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text:
# soup.get_text("|", strip=True)
u'I linked to|example.com'
But at that point you might want to use the.stripped_stringsgenerator instead, and process the text yourself:
# [u'I linked to', u'example.com']
Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about fourkindsof objects:,,, and.
If you can, I recommend you install and use lxml for speed.
Tags have a lot of attributes and methods, and I’ll cover most of them inNavigating the treeandSearching the tree. For now, the most important features of a tag are its name and attributes.
Using a tag name as an attribute will give you only thefirsttag by that name:
If you need to getallthe tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described inSearching the tree, such asfind_all(:
Every tag has a name, accessible as
Pass in a value for and you’ll tell Beautiful Soup to only consider tags with certain names. Text strings will be ignored, as will tags whose names that don’t match.
This is the simplest usage:
soup.find_all("title")# [The Dormouse's story]
Recall fromKinds of filtersthat the value tocan bea string,a regular expression,a list,a function, orthe value True.
A tag may have any number of attributes. The taghas an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary
You can access that dictionary directly as
, Beautiful Soup will filter against each tag’s ‘id’ attribute