我有一个用例,在这个用例中,需要使用多个线程以部分方式下载一个大型远程文件。每个线程必须同时运行(并行),抓取文件的一个特定部分。我们的期望是,一旦成功下载了所有部件,就将这些部件合并到一个(原始)文件中。
也许使用请求库可以完成这项工作,但是我不知道如何将它多线程到一个将块组合在一起的解决方案中。
url = 'https://url.com/file.iso'
headers = {"Range": "bytes=0-1000000"} # first megabyte
r = get(url, headers=headers)
我还在考虑使用curl,Python将在其中编排下载,但我不确定这是正确的方式。它看起来太复杂了,远离了普通的Python解决方案。就像这样:
curl --range 200000000-399999999 -o file.iso.part2
有人能解释一下你会怎么做这种事吗?或者在Python 3中发布一个代码示例?我通常很容易找到与Python相关的答案,但是这个问题的解决方案似乎在逃避我。
发布于 2019-10-26 14:49:38
这里有一个使用Python 3和Asyncio的版本,它只是一个例子,它可以改进,但是您应该能够获得所需的一切。
get_size
:发送一个头请求以获取文件的大小download_range
:下载一个块download
:下载所有块并合并它们import asyncio
import concurrent.futures
import functools
import requests
import os
# WARNING:
# Here I'm pointing to a publicly available sample video.
# If you are planning on running this code, make sure the
# video is still available as it might change location or get deleted.
# If necessary, replace it with a URL you know is working.
URL = 'https://download.samplelib.com/mp4/sample-30s.mp4'
OUTPUT = 'video.mp4'
async def get_size(url):
response = requests.head(url)
size = int(response.headers['Content-Length'])
return size
def download_range(url, start, end, output):
headers = {'Range': f'bytes={start}-{end}'}
response = requests.get(url, headers=headers)
with open(output, 'wb') as f:
for part in response.iter_content(1024):
f.write(part)
async def download(run, loop, url, output, chunk_size=1000000):
file_size = await get_size(url)
chunks = range(0, file_size, chunk_size)
tasks = [
run(
download_range,
url,
start,
start + chunk_size - 1,
f'{output}.part{i}',
)
for i, start in enumerate(chunks)
]
await asyncio.wait(tasks)
with open(output, 'wb') as o:
for i in range(len(chunks)):
chunk_path = f'{output}.part{i}'
with open(chunk_path, 'rb') as s:
o.write(s.read())
os.remove(chunk_path)
if __name__ == '__main__':
executor = concurrent.futures.ThreadPoolExecutor(max_workers=3)
loop = asyncio.new_event_loop()
run = functools.partial(loop.run_in_executor, executor)
asyncio.set_event_loop(loop)
try:
loop.run_until_complete(
download(run, loop, URL, OUTPUT)
)
finally:
loop.close()
发布于 2019-10-26 14:48:10
您可以使用G请求书并行下载。
import grequests
URL = 'https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-10.1.0-amd64-netinst.iso'
CHUNK_SIZE = 104857600 # 100 MB
HEADERS = []
_start, _stop = 0, 0
for x in range(4): # file size is > 300MB, so we download in 4 parts.
_start = _stop
_stop = 104857600 * (x + 1)
HEADERS.append({"Range": "bytes=%s-%s" % (_start, _stop)})
rs = (grequests.get(URL, headers=h) for h in HEADERS)
downloads = grequests.map(rs)
with open('/tmp/debian-10.1.0-amd64-netinst.iso', 'ab') as f:
for download in downloads:
print(download.status_code)
f.write(download.content)
PS:我没有检查这些范围是否正确确定,下载的md5sum是否匹配!从总体上看,这只会显示它是如何工作的。
发布于 2022-06-15 14:27:31
我发现最好的方法是使用一个名为pySmartDL的模块。
步骤1:pip install pySmartDL
步骤2:下载您可以使用的文件
from pySmartDL import SmartDL
obj = SmartDL(url, destination)
obj.start()
注意:默认情况下,这会给你一个下载表。
如果您需要将下载进度与gui挂钩,您可以使用
obj = SmartDL(url, dest,progress_bar=False)
obj.start(blocking=False)
while not obj.isFinished():
download_precentage = round(obj.get_progress()*100,2)
time.sleep(0.2)
print(download_precentage)
如果您想使用更多的线程,您可以使用
obj = SmartDL(url, destination,threads=7) #by default thread = 5
obj.start()
您可以从项目页面中找到更多的功能。
下载:http://pypi.python.org/pypi/pySmartDL/ 文档:http://itaybb.github.io/pySmartDL/ 项目页面:https://github.com/iTaybb/pySmartDL/ Bug与问题:https://github.com/iTaybb/pySmartDL/issues
https://stackoverflow.com/questions/58571343
复制相似问题