image.png
HTML通过预定义的<>…</>标签形式组织不同类型的信息
image.png
image.png
image.png
image.png
image.png
image.png
image.png
image.png
image.png
image.png
image.png
image.png
image.png
image.png
从标记后的信息中提取所关注的内容
提取HTML中所有URL链接
思路:
image.png
image.png
<>.find_all(name, attrs, recursive, string, **kwargs) ∙ name : 对标签名称的检索字符串 返回一个列表类型,存储查找的结果
image.png
image.png
<>.find_all(name, attrs, recursive, string, **kwargs) ∙ name : 对标签名称的检索字符串 ∙ attrs: 对标签属性值的检索字符串,可标注属性检索
image.png
>>> soup.find_all('p', 'course')
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
>>> soup.find_all('p','course')
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
>>> soup.find_all('p','title')
[<p class="title"><b>The demo python introduces several python courses.</b></p>]
>>> soup.find_all(id = 'link1')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
>>>
<>.find_all(name, attrs, recursive, string, **kwargs) ∙ name : 对标签名称的检索字符串 ∙ attrs: 对标签属性值的检索字符串,可标注属性检索
>>> soup.find_all(id = re.compile('link'))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
<>.find_all(name, attrs, recursive, string, **kwargs) ∙ name : 对标签名称的检索字符串 ∙ attrs: 对标签属性值的检索字符串,可标注属性检索 ∙ recursive: 是否对子孙全部检索,默认True
>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all('a',recursive=False)
[]
<>.find_all(name, attrs, recursive, string, **kwargs) ∙ name : 对标签名称的检索字符串 ∙ attrs: 对标签属性值的检索字符串,可标注属性检索 ∙ recursive: 是否对子孙全部检索,默认True ∙ string: <>…</>中字符串区域的检索字符串
>>> soup
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> soup.find_all(string='Basic Python')
['Basic Python']
>>> import re
>>> soup.find_all(string=re.compile('python'))
['This is a python demo page', 'The demo python introduces several python courses.']
<tag>(..) 等价于 <tag>.find_all(..) soup(..) 等价于 soup.find_all(..)
image.png
image.png