我正在尝试以异步方式使用Python,以便加速对服务器的请求。服务器有一个缓慢的响应时间(通常是几秒,但有时也比一秒快),但在并行中工作得很好。我没有访问此服务器的权限,也无法对其进行任何更改。因此,我有一个很大的URL列表(在下面的代码中,pages
),我预先知道,并希望通过一次发出NO_TASKS=5
请求来加速它们的加载。另一方面,我不想让服务器过载,所以我希望每个请求之间至少有1秒的停顿(即每秒1个请求的限制)。
到目前为止,我已经使用Trio队列成功地实现了信号量部分(一次五个请求)。
import asks
import time
import trio
NO_TASKS = 5
asks.init('trio')
asks_session = asks.Session()
queue = trio.Queue(NO_TASKS)
next_request_at = 0
results = []
pages = [
'https://www.yahoo.com/',
'http://www.cnn.com',
'http://www.python.org',
'http://www.jython.org',
'http://www.pypy.org',
'http://www.perl.org',
'http://www.cisco.com',
'http://www.facebook.com',
'http://www.twitter.com',
'http://www.macrumors.com/',
'http://arstechnica.com/',
'http://www.reuters.com/',
'http://abcnews.go.com/',
'http://www.cnbc.com/',
]
async def async_load_page(url):
global next_request_at
sleep = next_request_at
next_request_at = max(trio.current_time() + 1, next_request_at)
await trio.sleep_until(sleep)
next_request_at = max(trio.current_time() + 1, next_request_at)
print('start loading page {} at {} seconds'.format(url, trio.current_time()))
req = await asks_session.get(url)
results.append(req.text)
async def producer(url):
await queue.put(url)
async def consumer():
while True:
if queue.empty():
print('queue empty')
return
url = await queue.get()
await async_load_page(url)
async def main():
async with trio.open_nursery() as nursery:
for page in pages:
nursery.start_soon(producer, page)
await trio.sleep(0.2)
for _ in range(NO_TASKS):
nursery.start_soon(consumer)
start = time.time()
trio.run(main)
然而,我遗漏了限制部分的实现,即max的实现。每秒1个请求。您可以在上面看到我这样做的尝试(async_load_page
的前五行),但是当您执行代码时可以看到,这是不起作用的:
start loading page http://www.reuters.com/ at 58097.12261669573 seconds
start loading page http://www.python.org at 58098.12367392373 seconds
start loading page http://www.pypy.org at 58098.12380622773 seconds
start loading page http://www.macrumors.com/ at 58098.12389389973 seconds
start loading page http://www.cisco.com at 58098.12397854373 seconds
start loading page http://arstechnica.com/ at 58098.12405119873 seconds
start loading page http://www.facebook.com at 58099.12458010273 seconds
start loading page http://www.twitter.com at 58099.37738939873 seconds
start loading page http://www.perl.org at 58100.37830828273 seconds
start loading page http://www.cnbc.com/ at 58100.91712723473 seconds
start loading page http://abcnews.go.com/ at 58101.91770178373 seconds
start loading page http://www.jython.org at 58102.91875295573 seconds
start loading page https://www.yahoo.com/ at 58103.91993155273 seconds
start loading page http://www.cnn.com at 58104.48031027673 seconds
queue empty
queue empty
queue empty
queue empty
queue empty
我花了一些时间寻找答案,但找不到任何答案。
发布于 2018-10-04 04:19:55
实现目标的方法之一是使用工作线程在发送请求之前获取的互斥锁,并在一段时间间隔后在单独的任务中释放:
async def fetch_urls(urls: Iterator, responses, n_workers, throttle):
# Using binary `trio.Semaphore` to be able
# to release it from a separate task.
mutex = trio.Semaphore(1)
async def tick():
await trio.sleep(throttle)
mutex.release()
async def worker():
for url in urls:
await mutex.acquire()
nursery.start_soon(tick)
response = await asks.get(url)
responses.append(response)
async with trio.open_nursery() as nursery:
for _ in range(n_workers):
nursery.start_soon(worker)
如果worker
在throttle
秒后收到响应,它将在await mutex.acquire()
上阻塞。否则,tick
将释放该mutex
,另一个worker
将能够获取它。
这类似于leaky bucket算法的工作原理:
mutex
的工人就像桶里的水。tick
就像一个桶在以恒定的速度漏水。如果您在发送请求之前添加了一些日志记录,您应该会得到类似以下内容的输出:
0.00169 started
0.001821 n_workers: 5
0.001833 throttle: 1
0.002152 fetching https://httpbin.org/delay/4
1.012 fetching https://httpbin.org/delay/2
2.014 fetching https://httpbin.org/delay/2
3.017 fetching https://httpbin.org/delay/3
4.02 fetching https://httpbin.org/delay/0
5.022 fetching https://httpbin.org/delay/2
6.024 fetching https://httpbin.org/delay/2
7.026 fetching https://httpbin.org/delay/3
8.029 fetching https://httpbin.org/delay/0
9.031 fetching https://httpbin.org/delay/0
10.61 finished
发布于 2018-07-10 04:05:52
使用trio.current_time()
来做这件事太复杂了。
进行速率限制的最简单方法是速率限制器,即基本上执行以下操作的单独任务:
async def ratelimit(queue,tick, task_status=trio.TASK_STATUS_IGNORED):
with trio.open_cancel_scope() as scope:
task_status.started(scope)
while True:
await queue.get()
await trio.sleep(tick)
示例用法:
async with trio.open_nursery() as nursery:
q = trio.Queue(0)
limiter = await nursery.start(ratelimit, q, 1)
while whatever:
await q.put(None) # will return at most once per second
do_whatever()
limiter.cancel()
换句话说,任务开始于
q = trio.Queue(0)
limiter = await nursery.start(ratelimit, q, 1)
然后,您可以确保最多调用
await q.put(None)
将返回per,因为长度为零的队列充当集合点。完成后,调用
limiter.cancel()
停止限速任务,否则您的托儿所将无法退出。
如果您的用例包括在取消限制器之前需要完成的启动子任务,则最简单的方法是在另一个托儿所中振铃它们,即,而不是
while whatever:
await q.put(None) # will return at most once per second
do_whatever()
limiter.cancel()
你可以使用像这样的东西
async with trio.open_nursery() as inner_nursery:
await start_tasks(inner_nursery, q)
limiter.cancel()
它将等待任务完成,然后再接触限制器。
注意:您可以轻松地将其调整为“突发”模式,即在速率限制生效之前允许一定数量的请求,只需增加队列的长度即可。
发布于 2019-02-02 04:14:27
此解决方案的动机和来源
自从我提出这个问题以来,已经有几个月了。从那时起,Python得到了改进,trio也是如此(我对它们的了解也是如此)。因此,我认为是时候使用Python3.6和类型注释和trio-0.10内存通道进行一点更新了。
我开发了自己对原始版本的改进,但在阅读了@Roman Novatorov的伟大解决方案后,再次对其进行了修改,这就是结果。他因函数的主要结构(以及使用httpbin.org进行演示的想法)而受到赞誉。我选择使用内存通道而不是互斥锁,以便能够从worker中取出任何令牌重新释放逻辑。
解的解释
我可以这样重新表述原来的问题:
之间进行协调
如果您不熟悉内存通道及其语法,可以在trio doc中阅读它们。我认为async with memory_channel
和memory_channel.clone()
的逻辑一开始可能会混淆。
from typing import List, Iterator
import asks
import trio
asks.init('trio')
links: List[str] = [
'https://httpbin.org/delay/7',
'https://httpbin.org/delay/6',
'https://httpbin.org/delay/4'
] * 3
async def fetch_urls(urls: List[str], number_workers: int, throttle_rate: float):
async def token_issuer(token_sender: trio.abc.SendChannel, number_tokens: int):
async with token_sender:
for _ in range(number_tokens):
await token_sender.send(None)
await trio.sleep(1 / throttle_rate)
async def worker(url_iterator: Iterator, token_receiver: trio.abc.ReceiveChannel):
async with token_receiver:
for url in url_iterator:
await token_receiver.receive()
print(f'[{round(trio.current_time(), 2)}] Start loading link: {url}')
response = await asks.get(url)
# print(f'[{round(trio.current_time(), 2)}] Loaded link: {url}')
responses.append(response)
responses = []
url_iterator = iter(urls)
token_send_channel, token_receive_channel = trio.open_memory_channel(0)
async with trio.open_nursery() as nursery:
async with token_receive_channel:
nursery.start_soon(token_issuer, token_send_channel.clone(), len(urls))
for _ in range(number_workers):
nursery.start_soon(worker, url_iterator, token_receive_channel.clone())
return responses
responses = trio.run(fetch_urls, links, 5, 1.)
日志输出示例:
如您所见,所有页面请求之间的最小时间是一秒:
[177878.99] Start loading link: https://httpbin.org/delay/7
[177879.99] Start loading link: https://httpbin.org/delay/6
[177880.99] Start loading link: https://httpbin.org/delay/4
[177881.99] Start loading link: https://httpbin.org/delay/7
[177882.99] Start loading link: https://httpbin.org/delay/6
[177886.20] Start loading link: https://httpbin.org/delay/4
[177887.20] Start loading link: https://httpbin.org/delay/7
[177888.20] Start loading link: https://httpbin.org/delay/6
[177889.44] Start loading link: https://httpbin.org/delay/4
对解决方案的评论
对于异步代码来说,这种解决方案并不是不常见的,它不会保持所请求urls的原始顺序。解决此问题的一种方式是将id与例如具有元组结构的原始url相关联,将响应放入响应字典中,然后逐个抓取响应以将它们放入响应列表中(节省排序并且具有线性复杂性)。
https://stackoverflow.com/questions/51250706
复制相似问题