我试着从这个网页上刮掉所有5000家公司。当我向下滚动时,它的动态页面和公司都会被加载。但是我只能刮掉5公司,那么我怎么才能刮掉所有的5000家呢? URL正在改变,因为我向下滚动页面。我试过硒但不起作用。https://www.inc.com/profile/onetrust注:我想刮公司的所有信息,但现在选择了两个。
import time
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
my_url = 'https://www.inc.com/profile/onetrust'
options = Options()
driver = webdriver.Chrome(chrome_options=options)
driver.get(my_url)
time.sleep(3)
page = driver.page_source
driver.quit()
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.find_all("div", class_="sc-prOVx cTseUq company-profile")
container = containers[0]
for container in containers:
rank = container.h2.get_text()
company_name_1 = container.find_all("h2", class_="sc-AxgMl LXebc h2")
Company_name = company_name_1[0].get_text()
print("rank :" + rank)
print("Company_name :" + Company_name)更新了代码,但是页面根本没有滚动。修正了BeautifulSoup码中的一些错误
import time
from bs4 import BeautifulSoup as soup
from selenium import webdriver
my_url = 'https://www.inc.com/profile/onetrust'
driver = webdriver.Chrome()
driver.get(my_url)
def scroll_down(self):
"""A method for scrolling the page."""
# Get scroll height.
last_height = self.driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to the bottom.
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load the page.
time.sleep(2)
# Calculate new scroll height and compare with last scroll height.
new_height = self.driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
page_soup = soup(driver.page_source, "html.parser")
containers = page_soup.find_all("div", class_="sc-prOVx cTseUq company-profile")
container = containers[0]
for container in containers:
rank = container.h2.get_text()
company_name_1 = container.find_all("h2", class_="sc-AxgMl LXebc h2")
Company_name = company_name_1[0].get_text()
print("rank :" + rank)
print("Company_name :" + Company_name)感谢您的阅读!
发布于 2020-11-06 11:47:23
尝试下面的方法-- 请求 --当涉及到请求时,使用简单、简单、可靠、快速和更少的代码。我已经从网站本身获取API URL后,检查了谷歌铬浏览器的网络部分。
下面的脚本到底在做什么:
https://stackoverflow.com/questions/64701137
复制相似问题