我正在使用BeautifulSoup抓取“华尔街日报”,但它似乎永远找不到id=的“顶部新闻”元素,它总是可以在主页上找到。我已经尝试了find()、find_all()和各种其他方法,它们都为在NoneType
对象上调用的任何方法返回一个NoneType
。
我试图提取关于头条新闻文章的元数据,主要是文章标题和url。每一篇文章的元数据都在一个名为“WSJTheme--标题--7VCzo7Ay”的类下,但我只希望那些位于“头条新闻”的类中。
这是我的代码:
import requests
from bs4 import BeautifulSoup
from shutil import copyfile
URL = 'https://www.wsj.com'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='top-news')
topArticles = results.find_all('div', class_='WSJTheme--headline--7VCzo7Ay ')
发布于 2021-05-27 07:34:13
指定User-Agent
从服务器获得正确的响应:
import requests
from bs4 import BeautifulSoup
url = "https://www.wsj.com/"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
for headline in soup.select('#top-news span[class*="headline"]'):
print(headline.text)
指纹:
Oil Giants Dealt Defeats as Climate Pressures Intensify
At Least Eight Killed in San Jose Shooting
HSBC to Exit Most U.S. Retail Banking
Amazon-MGM Deal Marks Win for Hedge Funds
Cities Reverse Defunding the Police Amid Rising Crime
Federal Prosecutors Have Asked Banks for Information About Archegos Meltdown
Why a Grand Plan to Vaccinate the World Against Covid Unraveled
Inside the Israel-Hamas Conflict and One of Its Deadliest Hours in Gaza
Eric Carle, ‘The Very Hungry Caterpillar’ Author, Dies at 91
Wynn May Face U.S. Action for Role in China’s Push to Expel Businessman
Walmart to Sell New Line of Gap-Branded Homegoods
https://stackoverflow.com/questions/67716937
复制相似问题