我试图使用请求模块下载一个PDF文件,代码如下:
import requests
url = "<url of the pdf>"
r = requests.get(url, stream=True, timeout=(60, 120), headers={'Connection': 'keep-alive','User-Agent': 'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10136'})
print(r.headers)
print(r.status_code)
try:
with open('blah.pdf', 'wb') as f:
for chunk in r:
# print(chunk)
f.write(chunk)
except Exception as e:
print(e)
产出如下:
{'Cache-Control': 'private', 'Transfer-Encoding': 'chunked', 'Content-Type': 'application/pdf', 'Server': 'Microsoft-IIS/7.5', 'X-AspNet-Version': '4.0.30319', 'X-Powered-By': 'ASP.NET', 'Date': 'Wed, 02 Oct 2019 05:17:11 GMT', 'Set-Cookie': 'bbb=rd102o00000000000000000000ffff978433aao80; path=/; Httponly; Secure'}
200
('Connection broken: IncompleteRead(0 bytes read, 2 more expected)', IncompleteRead(0 bytes read, 2 more expected))
下面是完整的堆栈跟踪:
Traceback (most recent call last):
File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 425, in _error_catcher
yield
File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 755, in read_chunked
chunk = self._handle_chunk(amt)
File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 709, in _handle_chunk
self._fp._safe_read(2) # Toss the CRLF at the end of the chunk.
File "/storage/anaconda3/lib/python3.7/http/client.py", line 612, in _safe_read
raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(0 bytes read, 2 more expected)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/storage/anaconda3/lib/python3.7/site-packages/requests/models.py", line 750, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 560, in stream
for line in self.read_chunked(amt, decode_content=decode_content):
File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 781, in read_chunked
self._original_response.close()
File "/storage/anaconda3/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "/storage/anaconda3/lib/python3.7/site-packages/urllib3/response.py", line 443, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read, 2 more expected)', IncompleteRead(0 bytes read, 2 more expected))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test.py", line 12, in <module>
for chunk in r:
File "/storage/anaconda3/lib/python3.7/site-packages/requests/models.py", line 753, in generate
raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read, 2 more expected)', IncompleteRead(0 bytes read, 2 more expected))
当我在Google Chrome等Web浏览器上打开这个pdf时,chrome的内置pdf插件可以正确地加载它,并且可以在浏览器上读取。然而,如果我试图通过点击下载图标下载它,我得到的Failed - Network Error
火狐无法加载/下载它。(火狐和Chrome都升级到最新版本)当我在windows机器上测试它时,微软edge下载了pdf .
上面的代码,如果我用其他的pdfs来测试它,比如这个:https://adobe.com/content/dam/acom/en/accessibility/products/acrobat/pdfs/acrobat-x-accessibility-checker.pdf
它工作得很完美。
我尝试过一些命令行工具,如curl、wget、aria2c (设置了适当的标题,比如浏览器请求),但都无法下载pdf。
wget输出:
connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/pdf]
Saving to: ‘blah.pdf’
<pdf_url> [ <=> ] 101.68K 66.1KB/s in 1.5s
2019-10-02 11:29:50 (69.1 KB/s) - Read error at byte 108786 (Success).
使用wget
下载的文件已损坏。
我尝试过的另一件事是使用mitm和chromedriver+selenium组合检查它。
自动铬浏览器无法加载pdf并显示错误:
502 Bad Gateway
HttpSyntaxException('Malformed chunked body',)
如何使用requests
模块下载此pdf?任何帮助都将不胜感激。
发布于 2019-10-08 20:49:21
几天后我就解决了这个问题。服务器不正确地关闭了连接,因此python库正在抛出IncompleteReadError
。我使用安装在系统中的带有参数curl
的--compressed
和所有必要的头来下载它:
from subprocess import call
pdf_url = ""
pdf_filename = ""
call(["curl", pdf_url,
'-H', 'Connection: keep-alive',
'-H', 'Cache-Control: max-age=0',
'-H', 'Upgrade-Insecure-Requests: 1',
'-H', 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',
'-H', 'Sec-Fetch-Mode: navigate',
'-H', 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'-H', 'Sec-Fetch-Site: cross-site',
'-H', 'Accept-Encoding: gzip, deflate, br',
'-H', 'Accept-Language: en-US,en;q=0.9,bn;q=0.8',
'-H', 'Cookie: bbb=rd102o00000000000000000000ffff978432aao80',
'--compressed', '--output', pdf_filename])
使用打电话方法的子过程模块。即使curl显示了如下所示的错误消息:
curl: (18) transfer closed with outstanding read data remaining
但是,下载的pdf工作,可以打开与任何pdf浏览器。
发布于 2019-10-02 08:23:36
我和你有同样的问题,我不知道为什么会这样。我用urrlib解决了这个问题:
urllib.request.urlretrieve(url, 'foo_file.txt', data=your_queries)
Url检索方法所做的是从链接中获取数据,并在您指定的文件名和作为第二个参数的路径中复制它。您还可以将类型更改为.pdf、.json等。
您在这里有更多信息:https://docs.python.org/3.7/library/urllib.request.html#module-urllib.request
https://stackoverflow.com/questions/58195791
复制相似问题