在 Python 中有两种方式可以发送 HTTP 请求,分别是自带的 urllib 库和第三方的 requests 库。
urllib 库:Python 内置的 HTTP 请求库,无需额外安装即可使用;Python 2 中有 urllib 和 urllib2 两个库来实现请求的发送,Python 3 中统一为 urllib。官方文档:https://docs.python.org/3/library/urllib.html
urllib所包含的常用模块
urllib.request所包含的常用方法
urllib.request所包含的常用方法
urllib.parse 所包含的常用方法
urllib.robotparser 所包含的类
基本使用方法
urllib.request.urlopen(url, data=None, [timeout,]*, cafile=None, capath=None, cadefault=False, context=None)
使用
import urllib.request
response = urllib.request.urlopen('https://angelni.github.io/')
print(response)
输出响应对象的类型和属性:
import urllib.request
response = urllib.request.urlopen('https://angelni.github.io/')
print(type(response))
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))
运行结果
<class 'http.client.HTTPResponse'>
200
[('Connection', 'close'), ('Content-Length', '36930'), ('Content-Type', 'text/html; charset=utf-8'), ('Server', 'GitHub.com'), ('Strict-Transport-Security', 'max-age=rict-Transport-Security', 'max-age=31556952'), ('Last-Modified', 'Mon, 11 May 2020 05:38:18 GMT'), ('ETag', '"5eb8e4ca-9042"'), ('A('Expires', 'Thu, 21 May 2020 20:17ccess-Control-Allow-Origin', '*'), ('Expires', 'Thu, 21 May 2020 20:17:51 GMT'), ('Cache-Control', 'max-age=600'), ('X-Proxy-Cache', ('Accept-Ranges', 'bytes'), ('Dat, 'MISS'), ('X-GitHub-Request-Id', '9112:45A4:22C0FB:24F624:5EC6DF95'), ('Accept-Ranges', 'bytes'), ('Date', 'Fri, 22 May 2020 05:4IT'), ('X-Cache-Hits', '1'), ('X-Ti2:11 GMT'), ('Via', '1.1 varnish'), ('Age', '108'), ('X-Served-By', 'cache-tyo19939-TYO'), ('X-Cache', 'HIT'), ('X-Cache-Hits', '1'')]), ('X-Timer', 'S1590126131.497095,VS0,VE1'), ('Vary', 'Accept-Encoding'), ('X-Fastly-Request-ID', 'd0682d145390665c6ad5fa6b629b2af3a18a7654')]
GitHub.com
timeout参数
timeout 设置为0.1,0.1秒过后服务器没有响应,便会抛出 URLError 异常
举例
import urllib.request
response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
print(response.read())
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "f:\C-and-Python-Algorithn\python\Spider\1.py", line 3, in <module>
response = urllib.request.urlopen('https://angelni.github.io/', timeout=0.1)
File "E:\python\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "E:\python\lib\urllib\request.py", line 526, in open
response = self._open(req, data)
File "E:\python\lib\urllib\request.py", line 544, in _open
'_open', req)
File "E:\python\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "E:\python\lib\urllib\request.py", line 1361, in https_open
context=self._context, check_hostname=self._check_hostname)
File "E:\python\lib\urllib\request.py", line 1320, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error _ssl.c:825: The handshake operation timed out>
Request() 方法可以在请求的时候传入一些 data、headers 等信息
Request() 的构造方法:
class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
构造方法各个参数的解释:
import urllib.request
import urllib.parse
url = 'http://www.baidu.com/'
# 定制要伪装的头部
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
# 构建请求对象
request = urllib.request.Request(url=url, headers=headers)
# 发送请求
response = urllib.request.urlopen(request)
print(response.read().decode())
将获取到的 URL 内容保存到当前文件夹,简单举例:
import urllib.request
url = 'https://cdn.jsdelivr.net/gh/AngelNI/CDN@3.0/imgs/avatar.png'
# response = urllib.request.urlopen(image_url)
# with open('angelni.png', 'wb') as fp:
# fp.write(response.read())
urllib.request.urlretrieve(url, 'angelni.png')
如果打开一个不存在的页面,就会出现 URLError 错误,该错误有一个 reason 属性,用于返回错误的原因。简单举例:
from urllib import request, error
try:
response = request.urlopen('https://angelni.github.io/index/')
except error.URLError as e:
print(e.reason)
运行结果
Not Found
URLError 的子类,专门用来处理 HTTP 请求错误,比如认证请求失败等。它有如下3个属性:
简单举例:
from urllib import request, error
try:
response = request.urlopen('https://angelni.github.io/index/')
except error.HTTPError as e:
print(e.code, e.reason, e.headers)
运行结果
404 Not Found Connection: close
Content-Length: 14054
Content-Type: text/html; charset=utf-8
Server: GitHub.com
Strict-Transport-Security: max-age=31556952
ETag: "5eb8e4ca-36e6"
Access-Control-Allow-Origin: *
X-Proxy-Cache: MISS
X-GitHub-Request-Id: 6496:45A5:369AC6:3A1948:5EC76895
Accept-Ranges: bytes
Date: Fri, 22 May 2020 05:59:06 GMT
Via: 1.1 varnish
Age: 403
X-Served-By: cache-tyo19926-TYO
X-Cache: HIT
X-Cache-Hits: 1
X-Timer: S1590127147.595289,VS0,VE0
Vary: Accept-Encoding
X-Fastly-Request-ID: b2a7e9ca856a64157cd67ea4a59e449062baa169
因为 URLError 是 HTTPError 的父类,所以可以先选择捕获子类的错误,再去捕获父类的错误,前面的代码改进:
from urllib import request, error
try:
response = request.urlopen('https://angelni.github.io/index/')
except error.HTTPError as e:
print(e.reason, e.code, e.headers)
except error.URLError as e:
print(e.reason)
else:
print('Request Successfully')
将字典参数序列化为 GET 请求参数,示例:
from urllib.parse import urlencode
data = {
'ie': 'utf-8',
'wd': 'TRHX',
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(data)
print(url)
运行结果
http://www.baidu.com?ie=utf-8&wd=TRHX
与 urlencode() 相反,将 GET 请求参数反序列化转回字典,示例:
from urllib.parse import parse_qs
query = 'name=TRHX&age=20'
print(parse_qs(query))
运行结果
{'name': ['TRHX'], 'age': ['20']}
将参数转化为元组组成的列表,示例:
from urllib.parse import parse_qsl
query = 'name=TRHX&age=20'
print(parse_qsl(query))
运行结果
[('name', 'TRHX'), ('age', '20')]
对 URL 进行分段,返回 6 个结果,示例:
from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result)
运行结果
<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
返回结果为 ParseResult 类型的对象,含 scheme、netloc、path、params、query 和 fragment 6 个部分,依次代表协议、域名、路径、参数、查询条件、锚点
与 urlparse() 相反,对 URL 进行组合,传入的参数是一个可迭代对象,长度必须是 6,否则会抛出参数数量不足或者过多的问题,示例:
from urllib.parse import urlunparse
data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))
运行结果
http://www.baidu.com/index.html;user?a=6#comment
与 urlparse() 方法相似,但是它不再单独解析 params 部分,只返回 5 个结果。params 会合并到 path 中,示例:
from urllib.parse import urlsplit
result = urlsplit('http://www.baidu.com/index.html;user?id=5#comment')
print(result)
运行结果
SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')
与 urlunparse() 方法类似,对 URL 进行组合,传入的参数也是一个可迭代对象,长度必须为 5,示例:
from urllib.parse import urlunsplit
data = ['http', 'www.baidu.com', 'index.html', 'a=6', 'comment']
print(urlunsplit(data))
运行结果
http://www.baidu.com/index.html?a=6#comment
对 URL 进行组合,提供两个 URL 作为两个参数,将会自动分析 URL 的 scheme、netloc 和 path 这 3 个内容并对新链接缺失的部分进行补充,最后返回结果,示例:
from urllib.parse import urljoin
print(urljoin('http://www.baidu.com', 'friends.html'))
print(urljoin('http://www.baidu.com', 'https://www.itrhx.com/friends.html'))
print(urljoin('http://www.baidu.com/friends.html', 'https://www.itrhx.com/friends.html'))
print(urljoin('http://www.baidu.com/friends.html', 'https://www.itrhx.com/friends.html?id=2'))
print(urljoin('http://www.baidu.com?wd=trhx', 'https://www.itrhx.com/index.html'))
print(urljoin('http://www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com#comment', '?category=2'))
运行结果
http://www.baidu.com/friends.html
https://www.itrhx.com/friends.html
https://www.itrhx.com/friends.html
https://www.itrhx.com/friends.html?id=2
https://www.itrhx.com/index.html
http://www.baidu.com?category=2#comment
www.baidu.com?category=2#comment
www.baidu.com?category=2
将内容转化为 URL 编码的格式。当 URL 中带有中文参数时,可以将中文字符转化为 URL 编码,示例:
from urllib.parse import quote
keyword = '中国'
url = 'https://www.baidu.com/s?wd=' + quote(keyword)
print(url)
运行结果
https://www.baidu.com/s?wd=%E4%B8%AD%E5%9B%BD
与 quote() 方法相反,对 URL 进行解码,示例:
from urllib.parse import unquote
url = 'https://www.baidu.com/s?wd=%E4%B8%AD%E5%9B%BD'
print(unquote(url))
运行结果
https://www.baidu.com/s?wd=中国
Robots 协议即爬虫协议,用来告诉爬虫和搜索引擎哪些页面可以抓取,哪些不可以抓取。它通常是一个叫作 robots.txt 的文本文件,一般放在网站的根目录下。
robots.txt 基本格式:
User-agent:
Disallow:
Allow:
RobotFileParser 类的声明:
urllib.robotparser.RobotFileParser(url='')
常用方法及其解释:
以简书为例
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url('http://www.jianshu.com/robots.txt')
rp.read()
print(rp.can_fetch('*', 'https://www.jianshu.com/p/6d9527300b4c'))
print(rp.can_fetch('*', "http://www.jianshu.com/search?q=python&page=1&type=collections"))
False
False