我试图优化一个简单的网页刮刀,我做了。它从主页上的表中获取urls列表,然后转到每个“子”urls,并从这些页面获取信息。我能够成功地同步并使用concurrent.futures.ThreadPoolExecutor()
编写它。但是,我正在尝试优化它以使用asyncio
和httpx
,因为它们似乎非常快速地生成数百个http请求。
我使用asyncio
和httpx
编写了以下脚本,但是,我始终得到以下错误:
httpcore.RemoteProtocolError: Server disconnected without sending a response.
RuntimeError: The connection pool was closed while 4 HTTP requests/responses were still in-flight.
在运行脚本时,我似乎一直在失去连接。我甚至尝试运行它的同步版本,并得到相同的错误。我认为远程服务器阻止了我的请求,但是,--我可以运行我的原始程序--并且可以从同一个IP地址访问每个urls,而不需要发出。
是什么导致了这一异常,以及如何解决它?
import httpx
import asyncio
async def get_response(client, url):
resp = await client.get(url, headers=random_user_agent()) # Gets a random user agent.
html = resp.text
return html
async def main():
async with httpx.AsyncClient() as client:
tasks = []
# Get list of urls to parse.
urls = get_events('https://main-url-to-parse.com')
# Get the responses for the detail page for each event
for url in urls:
tasks.append(asyncio.ensure_future(get_response(client, url)))
detail_responses = await asyncio.gather(*tasks)
for resp in detail_responses:
event = get_details(resp) # Parse url and get desired info
asyncio.run(main())
发布于 2022-03-10 11:55:48
我也遇到了同样的问题,当其中一个asyncio.gather任务中出现异常时,当它被引发时,会导致httpxclient调用__ aexit __并取消所有当前请求,您可以通过使用return_exceptions=True for asyncio.gather绕过它:
async def main():
async with httpx.AsyncClient() as client:
tasks = []
# Get list of urls to parse.
urls = get_events('https://main-url-to-parse.com')
# Get the responses for the detail page for each event
for url in urls:
tasks.append(asyncio.ensure_future(get_response(client, url)))
detail_responses = await asyncio.gather(*tasks, return_exceptions=True)
for resp in detail_responses:
# here you would need to do smth with the exceptions
# if isinstance(resp, Exception): ...
event = get_details(resp) # Parse url and get desired info
https://stackoverflow.com/questions/71138509
复制相似问题