每当我试图使用selenium和BeautifulSoup刮取BeautifulSoup时,我都无法从单个页面中提取所有数据。
例如,对于一个由50个产品组成的搜索结果,将提取前15个产品的信息,而其余的信息将给出空值。
现在,我知道这与滚轮有关,但我不知道如何使它工作。知道怎么解决这个问题吗?
代码到现在为止
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from time import sleep
import csv
# create object for chrome options
chrome_options = Options()
#base_url = 'https://shopee.sg/search?keyword=disinfectant'
# set chrome driver options to disable any popup's from the website
# to find local path for chrome profile, open chrome browser
# and in the address bar type, "chrome://version"
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('start-maximized')
#chrome_options.add_argument('user-data-dir=C:\\Users\\username\\AppData\\Local\\Google\\Chrome\\User Data\\Default')
# To disable the message, "Chrome is being controlled by automated test software"
chrome_options.add_argument("disable-infobars")
# Pass the argument 1 to allow and 2 to block
chrome_options.add_experimental_option("prefs", {
"profile.default_content_setting_values.notifications": 2
})
def get_url(search_term):
"""Generate an url from the search term"""
template = "https://www.shopee.sg/search?keyword={}"
search_term = search_term.replace(' ','+')
#add term query to url
url = template.format(search_term)
#add page query placeholder
url+= '&page={}'
return url
def main(search_term):
# invoke the webdriver
driver = webdriver.Chrome(options = chrome_options)
item_cost = []
item_name = []
url=get_url(search_term)
for page in range(0,3):
driver.get(url.format(page))
delay = 5 #seconds
try:
WebDriverWait(driver, delay)
print ("Page is ready")
sleep(5)
html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
#print(html)
soup = BeautifulSoup(html, "html.parser")
#find the product description
for item_n in soup.find_all('div',{'class':'col-xs-2-4 shopee-search-item-result__item'}):
try:
description_soup = item_n.find('div',{'class':'yQmmFK _1POlWt _36CEnF'})
name = description_soup.text.strip()
except AttributeError:
name = ''
print(name)
item_name.append(name)
# find the price of items
for item_c in soup.find_all('div',{'class':'col-xs-2-4 shopee-search-item-result__item'}):
try:
price_soup = item_c.find('div',{'class':'WTFwws _1lK1eK _5W0f35'})
price_final = price_soup.find('span',{'class':'_29R_un'})
price = price_final.text.strip()
except AttributeError:
price = ''
print(price)
item_cost.append(price)
except TimeoutException:
print ("Loading took too much time!-Try again")
sleep(5)
rows = zip(item_name, item_cost)
with open('shopee_item_list.csv','w',newline='',encoding='utf-8') as f:
writer=csv.writer(f)
writer.writerow(['Product Description', 'Price'])
writer.writerows(rows)```
发布于 2021-03-22 17:28:00
问题是,当您向下滚动页面时,您试图抓取的产品会动态加载。可能有比我的更优雅的解决方案,但是我使用driver.execute_script (附加资源:https://www.geeksforgeeks.org/execute_script-driver-method-selenium-python)实现了一个简单的javascript scroller。
滚筒
它滚动到页面高度的十分之一,暂停500毫秒,然后继续。
driver.execute_script("""
var scroll = document.body.scrollHeight / 10;
var i = 0;
function scrollit(i) {
window.scrollBy({top: scroll, left: 0, behavior: 'smooth'});
i++;
if (i < 10) {
setTimeout(scrollit, 500, i);
}
}
scrollit(i);
""")
此外,您还有两个for循环,用于item_n in soup.find_all(.),用于item_c in soup.find_all(.)在同一个类中迭代div。我修正了这一点,在我的代码中,这样您就可以得到每个项目的价格和名称,而只使用一个for循环。
您还进行了尝试--除了语句(如果存在AttributeError,即如果您在soup.find_all中找到的项是NoneTypes)。我把这些简化成if语句,就像这个
name = item.find('div', {'class': 'yQmmFK _1POlWt _36CEnF'})
if name is not None:
name = name.text.strip()
else:
name = ''
最后,您使用zip对两个不同的列表(名称和价格)添加到csv文件中。我将这些单独的列表合并到for循环中的嵌套列表中,而不是附加到两个单独的列表中,并在末尾进行压缩。这节省了一个步骤,尽管它是可选的,可能不是您所需要的。
完整(更新)代码
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import csv
from time import sleep
# create object for chrome options
chrome_options = Options()
# base_url = 'https://shopee.sg/search?keyword=disinfectant'
# set chrome driver options to disable any popup's from the website
# to find local path for chrome profile, open chrome browser
# and in the address bar type, "chrome://version"
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('start-maximized')
# chrome_options.add_argument('user-data-dir=C:\\Users\\username\\AppData\\Local\\Google\\Chrome\\User Data\\Default')
# To disable the message, "Chrome is being controlled by automated test software"
chrome_options.add_argument("disable-infobars")
# Pass the argument 1 to allow and 2 to block
chrome_options.add_experimental_option("prefs", {
"profile.default_content_setting_values.notifications": 2
})
def get_url(search_term):
"""Generate an url from the search term"""
template = "https://www.shopee.sg/search?keyword={}"
search_term = search_term.replace(' ', '+')
# add term query to url
url = template.format(search_term)
# add page query placeholder
url += '&page={}'
return url
def main(search_term):
# invoke the webdriver
driver = webdriver.Chrome(options=chrome_options)
rows = []
url = get_url(search_term)
for page in range(0, 3):
driver.get(url.format(page))
WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.CLASS_NAME, "shopee-search-item-result__item")))
driver.execute_script("""
var scroll = document.body.scrollHeight / 10;
var i = 0;
function scrollit(i) {
window.scrollBy({top: scroll, left: 0, behavior: 'smooth'});
i++;
if (i < 10) {
setTimeout(scrollit, 500, i);
}
}
scrollit(i);
""")
sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
for item in soup.find_all('div', {'class': 'col-xs-2-4 shopee-search-item-result__item'}):
name = item.find('div', {'class': 'yQmmFK _1POlWt _36CEnF'})
if name is not None:
name = name.text.strip()
else:
name = ''
price = item.find('div', {'class': 'WTFwws _1lK1eK _5W0f35'})
if price is not None:
price = price.find('span', {'class': '_29R_un'}).text.strip()
else:
price = ''
print([name, price])
rows.append([name, price])
with open('shopee_item_list.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['Product Description', 'Price'])
writer.writerows(rows)
https://stackoverflow.com/questions/66747082
复制相似问题