我正在尝试使用Python和Beautifulsoup来抓取一个站点,但是这个站点需要很长时间才能加载,而且抓取速度很快,并且不能完全恢复。我想知道如何等待5秒才能使用Beautifulsoup检索源代码。
我认为密码是这样的:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
url = 'https://www.edocente.com.br/pnld/2020/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
req = Request(url, headers = headers)
response = urlopen(req)
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
soup.findAll('a', class_="btn bold mt-4 px-5")我无法恢复整个源代码,因为站点加载速度慢,而且我的标签也没有恢复。如何等待从站点恢复整个源代码?
我只想得到href标签的文本,如下所示:
<a href="/pnld/2020/obra/companhia-das-ciencias-6-ano-saraiva" class="btn bold mt-4 px-5">Ver Obra </a>
<a href="/pnld/2020/obra/companhia-das-ciencias-7-ano-saraiva" class="btn bold mt-4 px-5">Ver Obra </a>
<a href="/pnld/2020/obra/companhia-das-ciencias-8-ano-saraiva" class="btn bold mt-4 px-5">Ver Obra </a>我想要康复:
/pnld/2020/obra/companhia-das-ciencias-6-ano-saraiva
/pnld/2020/obra/companhia-das-ciencias-7-ano-saraiva
/pnld/2020/obra/companhia-das-ciencias-8-ano-saraiva该怎么做呢?谢谢
发布于 2022-02-09 16:03:20
我想在这个网址(https://www.edocente.com.br/pnld/2020/)上有一个dynamic网站。这意味着cant用requests或urllib加载动态网站。
为了加载动态网站,然后将它们保存到漂亮的汤中,您需要使用浏览器在后台加载网站。做这件事有很多利班里人。
下面是加载动态网站的片段
from playwright.sync_api import sync_playwright
def get_dynamic_soup(url: str) -> BeautifulSoup:
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url)
soup = BeautifulSoup(page.content(), "html.parser")
browser.close()
return soup安装python包
pip install playwright然后安装铬浏览器(在您的终端)
(shell prompt) > playwright install你已经准备好抓取动态网站了。
发布于 2022-02-09 16:37:24
您可以尝试使用aiohttp和asyncio异步获取页面。例如,让我们将url作为参数传递给ClientSession实例,现在我们有了一个名为resp的ClientResponse对象,您可以从响应中获取所需的所有信息。
通过:pip install cchardet aiodns aiohttp aiotthp[speedups]安装模块
import aiohttp
import asyncio
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import ssl
#ssl._create_default_https_context = ssl._create_unverified_context
url = 'https://www.edocente.com.br/pnld/2020/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
html = ''
async def main():
async with aiohttp.ClientSession() as session:
async with session.get(url, headers=headers, verify_ssl=False) as response:
print("Status:", response.status)
print("Content-type:", response.headers['content-type'])
html = await response.text()
soup = BeautifulSoup(html, 'html.parser')
print(soup.findAll('a', class_="btn bold mt-4 px-5"))
loop = asyncio.get_event_loop()
loop.run_until_complete(main())产出:
Status: 200
Content-type: text/html; charset=UTF-8
[<a :href="linkPrefixo + obra.tituloSeo" class="btn bold mt-4 px-5">Ver {{ (current_edition=='2021-objeto-2') ? 'Coleção' : 'Obra' }} </a>, <a class="btn bold mt-4 px-5">AGUARDE</a>, <a class="btn bold mt-4 px-5">AGUARDE</a>, <a class="btn bold mt-4 px-5">AGUARDE</a>]https://stackoverflow.com/questions/71052676
复制相似问题