首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >在web抓取时无法从文章中获取文本正文

在web抓取时无法从文章中获取文本正文
EN

Stack Overflow用户
提问于 2022-11-16 09:12:41
回答 1查看 52关注 0票数 0

我正在从https://www.scmp.com/网站上抓取新闻文章,虽然我可以从每一篇文章中得到标题或作者的名字,但我无法获得文章的正文或主要内容。我遵循了两种方法,但这两种方法都行不通。

第一方法

代码语言:javascript
运行
复制
options = webdriver.ChromeOptions()

lists = ['disable-popup-blocking']

caps = DesiredCapabilities().CHROME
caps["pageLoadStrategy"] = "normal"

driver.get('https://www.scmp.com/news/asia/east-asia/article/3199400/japan-asean-hold-summit-tokyo-around-december-2023-japanese-official')
driver.implicitly_wait(5)

bsObj = BeautifulSoup(driver.page_source, 'html.parser')
text_res = bsObj.select('div[class="details__body body"]') 
    
text = ""
for item in text_res:
    if item.get_text() == "":
        continue
    text = text + item.get_text().strip() + "\n"   

第二种方法

代码语言:javascript
运行
复制
options = webdriver.ChromeOptions()

driver = webdriver.Chrome(executable_path= r"E:\chromedriver\chromedriver.exe", options=options) #add your chrome path    

driver.get('https://www.scmp.com/news/asia/east-asia/article/3199400/japan-asean-hold-summit-tokyo-around-december-2023-japanese-official')
driver.implicitly_wait(5)

a = driver.find_element_by_class_name("details__body body").text
print(a)

请帮我处理这个。谢谢。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-11-16 16:52:30

“南中国早报”的文章有几个原因不能得到。

首先,当您使用selenium打开Chrome时,文章的URL会显示一个GDRP通知。

GDRP必须通过一个按钮点击接受。

其次,该页面还会显示一个弹出窗口来设置您的新闻首选项。

弹出的新闻首选项必须是X

第三,尝试使用selenium提取文本需要进行一些数据清理。我建议使用BeautifulSoup从页面上的脚本标记中提取干净的文章文本。

下面是一些粗略的代码,点击GDRP按钮,X弹出新闻首选项并提取文章文本。

可以对此代码进行改进,以满足您的需要。

代码语言:javascript
运行
复制
import json
from time import sleep
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

capabilities = DesiredCapabilities().CHROME

chrome_options = Options()
chrome_options.add_argument("--incognito")
chrome_options.add_argument("--disable-infobars")
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-popup-blocking")
chrome_options.add_argument("--ignore-certificate-errors")

# disable the banner "Chrome is being controlled by automated test software"
chrome_options.add_experimental_option("useAutomationExtension", False)
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])

driver = webdriver.Chrome('/usr/local/bin/chromedriver', options=chrome_options)

url_main = 'https://www.scmp.com/news/asia/east-asia/article/3199400/japan-asean-hold-summit-tokyo-around-december-2023-japanese-official'

driver.get(url_main)

driver.implicitly_wait(20)
element_has_bottom_message = WebDriverWait(driver, 120).until(EC.presence_of_element_located((By.CLASS_NAME, "has-bottom-messaging")))
if element_has_bottom_message:
    element_gdpr = WebDriverWait(driver, 120).until(
        EC.presence_of_element_located((By.CLASS_NAME, "gdpr-banner__accept")))
    if element_gdpr:
        gdrp_button = driver.find_element_by_xpath("//*[@class='gdpr-banner__accept']")
        driver.implicitly_wait(20)
        ActionChains(driver).move_to_element(gdrp_button).click(gdrp_button).perform()
        element_my_news_popup = WebDriverWait(driver, 120).until(
            EC.presence_of_element_located((By.CLASS_NAME, "my-news-landing-popup__icon-close")))
        if element_my_news_popup:
            my_news_popup = driver.find_element_by_xpath("//*[@class='my-news-landing-popup__icon-close']")
            ActionChains(driver).move_to_element(my_news_popup).click(my_news_popup).perform()
            driver.implicitly_wait(20)
            raw_soup = BeautifulSoup(driver.page_source, 'lxml')
            json_dictionaries = raw_soup.find_all(name='script', attrs={'type': 'application/ld+json'})
            if len(json_dictionaries) != 0:
                for json_dictionary in json_dictionaries:
                    dictionary = json.loads("".join(json_dictionary.contents), strict=False)
                    article_bool = bool([value for (key, value) in dictionary.items() if key == 'articleBody'])
                    if article_bool:
                        for key, value in dictionary.items():
                            if key == 'articleBody':
                                print(value)


sleep(30)
driver.close()
driver.quit()

输出

代码语言:javascript
运行
复制
The leaders of Japan and 10-member Asean on Saturday agreed to hold a summit in Tokyo 
in or around December next year to commemorate the 50th anniversary of their relationship, 
a Japanese official said. Japanese Prime Minister Fumio Kishida and his counterparts from 
the Association of Southeast Asian Nations also pledged to deepen their cooperative ties 
when they met in Phnom Penh, according to the official. Japan has been trying to boost 
relations with Asean at a time when some of its members are increasingly vigilant against 
China ’s assertive territorial claims in the East and South China seas . Why is Japan 
losing ground in Asean despite being a bigger investor than China? “Although concerns are 
growing over opaque and unfair development support, Japan will continue to back sustainable 
growth” of Southeast Asia , Kishida said at the outset of the meeting, which was open to 
the media, in a veiled reference to Beijing’s trade and economic practices. Leaders of 
several nations mentioned the importance of freedom of navigation and overflight in the 
South China Sea, and of the necessity of adhering to international law, the official said 
after the meeting. The agreement on the special summit in Tokyo came as the US and China 
have been intensifying their competition for influence in Southeast Asia. In November last 
year, China and Asean agreed to upgrade their ties to a “comprehensive strategic 
partnership” when the two sides held a special online summit commemorating the 30th 
anniversary of their dialogue, with Chinese President Xi Jinping making a rare appearance. 
China has stepped up efforts to expand its clout in the region as security tensions 
with the US escalate in nearby waters. After China’s move, the US in May declared with 
Asean that they had decided to elevate their relationship to a “comprehensive strategic 
partnership” as well. At the Asean-Japan gathering, Kishida also reiterated his support 
for the “Asean Outlook on the Indo-Pacific”, an initiative aimed at maintaining peace, 
freedom and prosperity in the region, the official said.
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/74457838

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档