首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >用selenium和BeautifulSoup在python中实现Web抓取shopee.sg

用selenium和BeautifulSoup在python中实现Web抓取shopee.sg
EN

Stack Overflow用户
提问于 2021-03-22 13:31:19
回答 1查看 1.2K关注 0票数 1

每当我试图使用selenium和BeautifulSoup刮取BeautifulSoup时,我都无法从单个页面中提取所有数据。

例如,对于一个由50个产品组成的搜索结果,将提取前15个产品的信息,而其余的信息将给出空值。

现在,我知道这与滚轮有关,但我不知道如何使它工作。知道怎么解决这个问题吗?

代码到现在为止

代码语言:javascript
运行
复制
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from time import sleep
import csv

# create object for chrome options
chrome_options = Options()
#base_url = 'https://shopee.sg/search?keyword=disinfectant'

# set chrome driver options to disable any popup's from the website
# to find local path for chrome profile, open chrome browser
# and in the address bar type, "chrome://version"
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('start-maximized')
#chrome_options.add_argument('user-data-dir=C:\\Users\\username\\AppData\\Local\\Google\\Chrome\\User Data\\Default')
# To disable the message, "Chrome is being controlled by automated test software"
chrome_options.add_argument("disable-infobars")
# Pass the argument 1 to allow and 2 to block
chrome_options.add_experimental_option("prefs", { 
    "profile.default_content_setting_values.notifications": 2
    })


def get_url(search_term):
    """Generate an url from the search term"""
    template = "https://www.shopee.sg/search?keyword={}"
    search_term = search_term.replace(' ','+')
    
    #add term query to url
    url = template.format(search_term)
    
    #add page query placeholder
    url+= '&page={}'
    
    return url

def main(search_term):
# invoke the webdriver
    driver = webdriver.Chrome(options = chrome_options)


    item_cost = []
    item_name = []
    url=get_url(search_term)

    for page in range(0,3):
        driver.get(url.format(page))
        delay = 5 #seconds


        try:
            WebDriverWait(driver, delay)
            print ("Page is ready")
            sleep(5)
            html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
            #print(html)
            soup = BeautifulSoup(html, "html.parser")
            #find the product description
            for item_n in soup.find_all('div',{'class':'col-xs-2-4 shopee-search-item-result__item'}):
                try:
                    description_soup = item_n.find('div',{'class':'yQmmFK _1POlWt _36CEnF'})
                    name = description_soup.text.strip()
                except AttributeError:
                    name = ''
                print(name)    
                item_name.append(name)

            # find the price of items
            for item_c in soup.find_all('div',{'class':'col-xs-2-4 shopee-search-item-result__item'}):
                try:
                    price_soup = item_c.find('div',{'class':'WTFwws _1lK1eK _5W0f35'})
                    price_final = price_soup.find('span',{'class':'_29R_un'})
                    price = price_final.text.strip()
                except AttributeError:
                    price = ''
                print(price)
                item_cost.append(price)
  
        except TimeoutException:
            print ("Loading took too much time!-Try again")
        sleep(5)
    rows = zip(item_name, item_cost)
    
    
    with open('shopee_item_list.csv','w',newline='',encoding='utf-8') as f:
        writer=csv.writer(f)
        writer.writerow(['Product Description', 'Price'])
        writer.writerows(rows)```
EN

回答 1

Stack Overflow用户

发布于 2021-03-22 17:28:00

问题是,当您向下滚动页面时,您试图抓取的产品会动态加载。可能有比我的更优雅的解决方案,但是我使用driver.execute_script (附加资源:https://www.geeksforgeeks.org/execute_script-driver-method-selenium-python)实现了一个简单的javascript scroller。

滚筒

它滚动到页面高度的十分之一,暂停500毫秒,然后继续。

代码语言:javascript
运行
复制
driver.execute_script("""
    var scroll = document.body.scrollHeight / 10;
    var i = 0;
    function scrollit(i) {
       window.scrollBy({top: scroll, left: 0, behavior: 'smooth'});
       i++;
       if (i < 10) {
           setTimeout(scrollit, 500, i);
       }
    }
    scrollit(i);
""")

此外,您还有两个for循环,用于item_n in soup.find_all(.),用于item_c in soup.find_all(.)在同一个类中迭代div。我修正了这一点,在我的代码中,这样您就可以得到每个项目的价格和名称,而只使用一个for循环。

您还进行了尝试--除了语句(如果存在AttributeError,即如果您在soup.find_all中找到的项是NoneTypes)。我把这些简化成if语句,就像这个

代码语言:javascript
运行
复制
name = item.find('div', {'class': 'yQmmFK _1POlWt _36CEnF'})
if name is not None:
    name = name.text.strip()
else:
    name = ''

最后,您使用zip对两个不同的列表(名称和价格)添加到csv文件中。我将这些单独的列表合并到for循环中的嵌套列表中,而不是附加到两个单独的列表中,并在末尾进行压缩。这节省了一个步骤,尽管它是可选的,可能不是您所需要的。

完整(更新)代码

代码语言:javascript
运行
复制
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import csv
from time import sleep
# create object for chrome options
chrome_options = Options()
# base_url = 'https://shopee.sg/search?keyword=disinfectant'

# set chrome driver options to disable any popup's from the website
# to find local path for chrome profile, open chrome browser
# and in the address bar type, "chrome://version"
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('start-maximized')
# chrome_options.add_argument('user-data-dir=C:\\Users\\username\\AppData\\Local\\Google\\Chrome\\User Data\\Default')
# To disable the message, "Chrome is being controlled by automated test software"
chrome_options.add_argument("disable-infobars")
# Pass the argument 1 to allow and 2 to block
chrome_options.add_experimental_option("prefs", {
    "profile.default_content_setting_values.notifications": 2
})


def get_url(search_term):
    """Generate an url from the search term"""
    template = "https://www.shopee.sg/search?keyword={}"
    search_term = search_term.replace(' ', '+')

    # add term query to url
    url = template.format(search_term)

    # add page query placeholder
    url += '&page={}'

    return url


def main(search_term):
    # invoke the webdriver
    driver = webdriver.Chrome(options=chrome_options)
    rows = []
    url = get_url(search_term)

    for page in range(0, 3):
        driver.get(url.format(page))
        WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.CLASS_NAME, "shopee-search-item-result__item")))
        driver.execute_script("""
        var scroll = document.body.scrollHeight / 10;
        var i = 0;
        function scrollit(i) {
           window.scrollBy({top: scroll, left: 0, behavior: 'smooth'});
           i++;
           if (i < 10) {
            setTimeout(scrollit, 500, i);
            }
        }
        scrollit(i);
        """)
        sleep(5)
        html = driver.page_source
        soup = BeautifulSoup(html, "html.parser")
        for item in soup.find_all('div', {'class': 'col-xs-2-4 shopee-search-item-result__item'}):
            name = item.find('div', {'class': 'yQmmFK _1POlWt _36CEnF'})
            if name is not None:
                name = name.text.strip()
            else:
                name = ''

            price = item.find('div', {'class': 'WTFwws _1lK1eK _5W0f35'})
            if price is not None:
                price = price.find('span', {'class': '_29R_un'}).text.strip()
            else:
                price = ''
            print([name, price])
            rows.append([name, price])

    with open('shopee_item_list.csv', 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['Product Description', 'Price'])
        writer.writerows(rows)
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/66747082

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档