xml格式示例:
与正则表达式相同,Xpath拥有自己的语法规则
在Xpath语言中,XML/HTML文档被称为节点数
Xpath表达式可以用来检索标签内容: 获取 标签的所有class属性: //div/@class
基于Xpath和Dom树两个基础知识,可以使用python库进行针对性的信息抽取 Python语言中处理XML和HTML的第三方库:
lxml是Python语言中处理XML和HTML的第三方库
从网络爬虫的角度来看,我们关注的是lxml的文本解析功能
在iPython环境中,使用lxml:from lxml import etree
根据目标文本的类型,lxml提供不同的函数来去解析:
from lxml import etree
data = """
<!DOCTYPE html>
<html>
<head>
<meta charset = "UTF-8">
<meta http-equiv="X-UA-Compatible"content="IE = edge">
...
<script src = "/static/js/pageJs/courses-list.js"></script>
</body>
</html>
"""
page = etree.HTML(data.encode("utf-8"))
获取网页标题中a标签的内容:
//div//li//a/text()
hrefs = page.xpath("//div//li//a/text()")
print()
for href in hrefs:
print(href)
以百度百科为例:
import requests
from lxml import etree
s=requests.session()
s.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0'}
page=s.get('https://baike.baidu.com/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB').content.decode("utf-8")
html = etree.HTML(page)
hrefs = html.xpath("//a/@href")
for href in hrefs:
print(href)
上面取出了百度百科中的所有链接。 得出的链接包括绝对链接和相对链接。
import requests
from lxml import etree
s=requests.session()
s.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0'}
page=s.get('https://baike.baidu.com/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB').content.decode("utf-8")
html = etree.HTML(page)
hrefs = html.xpath("//div[@class=\"para\"]/a/@href")
for href in hrefs:
print(href)
import requests
from lxml import etree
s=requests.session()
s.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0'}
page=s.get('https://baike.baidu.com/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB').content.decode("utf-8")
html = etree.HTML(page)
hrefs = html.xpath("//div[@class=\"para\"]/text()")
for href in hrefs:
print(href)
BeautifulSoup是Python语言中另一种解析XML/HTML的第三方解析库:
从网页中提取内容的方法: 正则表达式:
BeautifulSoup:
BeautifulSoup支持不同的解析器:
lxml作为bs4的一部分,是BeautifulSoup官方推荐的解析库
给BeautifulSoup的构造函数传递一个字符串或文件句柄,就可以解析HTML:
BeautifulSoup将DOM树中每个节点都表示成一个对象 这些节点对象可以归纳为以下几种:
BeautifulSoup的关键是学习操作不同的节点对象 下面的代码展示不同的节点类型:
还是以百度百科为例:
import requests
from bs4 import BeautifulSoup as bs
s=requests.session()
s.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0'}
page=s.get('https://baike.baidu.com/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB').content.decode("utf-8")
html = bs(page)
print(type(html))
print(type(html.html))
print(type(html.title.string))
import requests
from bs4 import BeautifulSoup as bs
s=requests.session()
s.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0'}
page=s.get('https://baike.baidu.com/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB').content.decode("utf-8")
html = bs(page)
paras = html.find_all(class_="para")
for para in paras:
print(para)
s=requests.session()
s.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0'}
page=s.get('https://baike.baidu.com/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB').content.decode("utf-8")
html = bs(page)
paras = html.find_all(class_="para")
for para in paras:
for a in para("a"):
if a.has_attr('href'):
print(a["href"])
标签定位的依据
eg:
import requests
from bs4 import BeautifulSoup as bs
s=requests.session()
s.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0'}
page=s.get('https://baike.baidu.com/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB').content.decode("utf-8")
html = bs(page)
paras = html.find_all(class_="para")
for para in paras:
print(para.text)