文章/答案/技术大牛

发布

社区首页 >问答首页 >未能在python中使用selenium抓取动态网页

问未能在python中使用selenium抓取动态网页
EN

Stack Overflow用户

提问于 2020-11-05 16:18:30

回答 1查看 88关注 0票数 0

我试着从这个网页上刮掉所有5000家公司。当我向下滚动时，它的动态页面和公司都会被加载。但是我只能刮掉5公司，那么我怎么才能刮掉所有的5000家呢？ URL正在改变，因为我向下滚动页面。我试过硒但不起作用。https://www.inc.com/profile/onetrust注:我想刮公司的所有信息，但现在选择了两个。

import time
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

my_url = 'https://www.inc.com/profile/onetrust'

options = Options()
driver = webdriver.Chrome(chrome_options=options)
driver.get(my_url)
time.sleep(3)
page = driver.page_source
driver.quit()

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

containers = page_soup.find_all("div", class_="sc-prOVx cTseUq company-profile")
container = containers[0]

for container in containers:
    rank = container.h2.get_text()
    company_name_1 = container.find_all("h2", class_="sc-AxgMl LXebc h2")
    Company_name = company_name_1[0].get_text()


    print("rank :" + rank)
    print("Company_name :" + Company_name)

更新了代码，但是页面根本没有滚动。修正了BeautifulSoup码中的一些错误

import time
from bs4 import BeautifulSoup as soup
from selenium import webdriver

my_url = 'https://www.inc.com/profile/onetrust'

driver = webdriver.Chrome()
driver.get(my_url)


def scroll_down(self):
    """A method for scrolling the page."""

    # Get scroll height.
    last_height = self.driver.execute_script("return document.body.scrollHeight")

    while True:

        # Scroll down to the bottom.
        self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load the page.
        time.sleep(2)

        # Calculate new scroll height and compare with last scroll height.
        new_height = self.driver.execute_script("return document.body.scrollHeight")

        if new_height == last_height:

            break

        last_height = new_height


page_soup = soup(driver.page_source, "html.parser")

containers = page_soup.find_all("div", class_="sc-prOVx cTseUq company-profile")
container = containers[0]

for container in containers:
    rank = container.h2.get_text()
    company_name_1 = container.find_all("h2", class_="sc-AxgMl LXebc h2")
    Company_name = company_name_1[0].get_text()


    print("rank :" + rank)
    print("Company_name :" + Company_name)

感谢您的阅读!

python

selenium

web-scraping

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-11-06 11:47:23

尝试下面的方法-- 请求 --当涉及到请求时，使用简单、简单、可靠、快速和更少的代码。我已经从网站本身获取API URL后，检查了谷歌铬浏览器的网络部分。

下面的脚本到底在做什么：

首先，它将接受API并执行GET请求。
获得数据后，脚本将使用json.loads库解析JSON数据。
最后，它将遍历所有的公司列表，并打印它们的ex:- Rank，公司名称，社会帐户链接，CEO姓名等。从URL 3导入JSON导入请求。异常导入requests.packages.urllib3.disable_warnings(InsecureRequestWarning) InsecureRequestWarning def scrap_inc_5000()：URL= 'https://www.inc.com/rest/companyprofile/nuleaf-naturals/withlist‘响应= requests.get(URL，InsecureRequestWarning= False)结果= json.loads(response.text) #分析结果使用json加载extracted_data = result’‘fullList '，用于extracted_data: print('-’* 100) print(‘-’*100) print('Rank：'，data‘秩’) print('Company：‘，数据‘公司’)打印(‘图标’)打印(‘首席执行官名称：'，数据’‘ifc_ceo_ Name’)打印(‘facebook地址：'，数据’‘ifc_facebook_ Address’)打印(文件位置：，数据‘’ifc_filelocation‘)打印(’linkedin地址：'，数据‘’ifc_linkedin_ Address‘)打印(’twitter句柄：'，数据‘’ifc_twitter_句柄‘)打印(’次要链接：“，'，数据‘’ifc_linkedin_address‘)打印(’twitter句柄：'，数据‘’ifc_twitter_句柄‘)打印(’二级链接：'，‘linkedin地址’)打印(‘twitter句柄：’，数据‘’ifc_twitter_句柄‘)打印(’第二链接：”，‘，’linkedin地址：‘，数据’‘ifc_linkedin_address’)打印(‘twitter句柄：’，数据‘’ifc_t数据‘备用链接’)打印(‘-’* 100) scrap_inc_5000()

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/64701137

复制

相似问题

问未能在python中使用selenium抓取动态网页
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问未能在python中使用selenium抓取动态网页EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问未能在python中使用selenium抓取动态网页
EN