文章/答案/技术大牛

发布

社区首页 >问答首页 >使用请求/selenium/cloudscraper返回空值的Web抓取

问使用请求/selenium/cloudscraper返回空值的Web抓取
EN

Stack Overflow用户

提问于 2022-03-30 16:25:21

回答 1查看 1.3K关注 0票数 1

我想我是想从一个受云彩保护的网站收集信息。我尝试了三种选择，它们都返回空值。所以，我不知道网站是否有任何障碍，或者我是否做错了什么。

--更新

然而，当我尝试在Colab中使用它时，F.Hoque提出的解决方案是有效的，我只能得到一个空值。

使用请求

import requests
import re
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://www.portaldoholanda.com.br/assaltante-surra/com-pedacos-de-madeira-populares-dao-surra-em-homem-assalt'

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")

soup.find('h1',class_="noticia titulo").text # I tried with select too (soup.select('[class="noticia titulo"]'))

使用cloudscraper

import cloudscraper

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
soup = BeautifulSoup(scraper.get(url, headers=headers).content, "html.parser")

soup.find('h1',class_="noticia titulo").text

使用selenium的

import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import InvalidSessionIdException
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument('--ignore-certificate-errors-spki-list')
options.add_argument('--ignore-ssl-errors')
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

river = webdriver.Chrome(options=options, executable_path='/usr/bin/chromedriver')
print("Current session is {}".format(driver.session_id))

driver.get(url)
html = BeautifulSoup(driver.page_source)
innerContent = html.find('h1',class_="noticia titulo").text

beautifulsoup

python-requests

python

selenium

web-scraping

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-03-30 16:45:13

是的，该网站正在使用云彩保护。

https://www.portaldoholanda.com.br/assaltante-surra/com-pedacos-de-madeira-populares-dao-surra-em-homem-assalt is using Cloudflare CDN/Proxy!

  

https://www.portaldoholanda.com.br/assaltante-surra/com-pedacos-de-madeira-populares-dao-surra-em-homem-assalt is using Cloudflare SSL!

下面是使用cloudScraper而不是requests的工作解决方案。

脚本：

import cloudscraper
from bs4 import BeautifulSoup
scraper = cloudscraper.create_scraper(delay=10,   browser={'custom': 'ScraperBot/1.0',})
url = "https://www.portaldoholanda.com.br/assaltante-surra/com-pedacos-de-madeira-populares-dao-surra-em-homem-assalt"
req= scraper.get(url)
#print(req)

soup = BeautifulSoup(req.content, "html.parser")
txt=soup.find('h1',class_="noticia titulo").text
print(txt)

输出：

Com pedaços de madeira, populares dão surra em homem em Manaus; veja vídeo

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/71680822

复制

相似问题

问使用请求/selenium/cloudscraper返回空值的Web抓取
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用请求/selenium/cloudscraper返回空值的Web抓取EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用请求/selenium/cloudscraper返回空值的Web抓取
EN