文章/答案/技术大牛

发布

社区首页 >问答首页 >用selenium和BeautifulSoup在python中实现Web抓取shopee.sg

问用selenium和BeautifulSoup在python中实现Web抓取shopee.sg
EN

Stack Overflow用户

提问于 2021-03-22 13:31:19

回答 1查看 1.2K关注 0票数 1

每当我试图使用selenium和BeautifulSoup刮取BeautifulSoup时，我都无法从单个页面中提取所有数据。

例如，对于一个由50个产品组成的搜索结果，将提取前15个产品的信息，而其余的信息将给出空值。

现在，我知道这与滚轮有关，但我不知道如何使它工作。知道怎么解决这个问题吗？

代码到现在为止

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from time import sleep
import csv

# create object for chrome options
chrome_options = Options()
#base_url = 'https://shopee.sg/search?keyword=disinfectant'

# set chrome driver options to disable any popup's from the website
# to find local path for chrome profile, open chrome browser
# and in the address bar type, "chrome://version"
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('start-maximized')
#chrome_options.add_argument('user-data-dir=C:\\Users\\username\\AppData\\Local\\Google\\Chrome\\User Data\\Default')
# To disable the message, "Chrome is being controlled by automated test software"
chrome_options.add_argument("disable-infobars")
# Pass the argument 1 to allow and 2 to block
chrome_options.add_experimental_option("prefs", { 
    "profile.default_content_setting_values.notifications": 2
    })


def get_url(search_term):
    """Generate an url from the search term"""
    template = "https://www.shopee.sg/search?keyword={}"
    search_term = search_term.replace(' ','+')
    
    #add term query to url
    url = template.format(search_term)
    
    #add page query placeholder
    url+= '&page={}'
    
    return url

def main(search_term):
# invoke the webdriver
    driver = webdriver.Chrome(options = chrome_options)


    item_cost = []
    item_name = []
    url=get_url(search_term)

    for page in range(0,3):
        driver.get(url.format(page))
        delay = 5 #seconds


        try:
            WebDriverWait(driver, delay)
            print ("Page is ready")
            sleep(5)
            html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
            #print(html)
            soup = BeautifulSoup(html, "html.parser")
            #find the product description
            for item_n in soup.find_all('div',{'class':'col-xs-2-4 shopee-search-item-result__item'}):
                try:
                    description_soup = item_n.find('div',{'class':'yQmmFK _1POlWt _36CEnF'})
                    name = description_soup.text.strip()
                except AttributeError:
                    name = ''
                print(name)    
                item_name.append(name)

            # find the price of items
            for item_c in soup.find_all('div',{'class':'col-xs-2-4 shopee-search-item-result__item'}):
                try:
                    price_soup = item_c.find('div',{'class':'WTFwws _1lK1eK _5W0f35'})
                    price_final = price_soup.find('span',{'class':'_29R_un'})
                    price = price_final.text.strip()
                except AttributeError:
                    price = ''
                print(price)
                item_cost.append(price)
  
        except TimeoutException:
            print ("Loading took too much time!-Try again")
        sleep(5)
    rows = zip(item_name, item_cost)
    
    
    with open('shopee_item_list.csv','w',newline='',encoding='utf-8') as f:
        writer=csv.writer(f)
        writer.writerow(['Product Description', 'Price'])
        writer.writerows(rows)```

web-scraping

beautifulsoup

python

selenium

回答 1

Stack Overflow用户

发布于 2021-03-22 17:28:00

问题是，当您向下滚动页面时，您试图抓取的产品会动态加载。可能有比我的更优雅的解决方案，但是我使用driver.execute_script (附加资源：https://www.geeksforgeeks.org/execute_script-driver-method-selenium-python)实现了一个简单的javascript scroller。

滚筒

它滚动到页面高度的十分之一，暂停500毫秒，然后继续。

driver.execute_script("""
    var scroll = document.body.scrollHeight / 10;
    var i = 0;
    function scrollit(i) {
       window.scrollBy({top: scroll, left: 0, behavior: 'smooth'});
       i++;
       if (i < 10) {
           setTimeout(scrollit, 500, i);
       }
    }
    scrollit(i);
""")

此外，您还有两个for循环，用于item_n in soup.find_all(.)，用于item_c in soup.find_all(.)在同一个类中迭代div。我修正了这一点，在我的代码中，这样您就可以得到每个项目的价格和名称，而只使用一个for循环。

您还进行了尝试--除了语句(如果存在AttributeError，即如果您在soup.find_all中找到的项是NoneTypes)。我把这些简化成if语句，就像这个

name = item.find('div', {'class': 'yQmmFK _1POlWt _36CEnF'})
if name is not None:
    name = name.text.strip()
else:
    name = ''

最后，您使用zip对两个不同的列表(名称和价格)添加到csv文件中。我将这些单独的列表合并到for循环中的嵌套列表中，而不是附加到两个单独的列表中，并在末尾进行压缩。这节省了一个步骤，尽管它是可选的，可能不是您所需要的。

完整(更新)代码

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import csv
from time import sleep
# create object for chrome options
chrome_options = Options()
# base_url = 'https://shopee.sg/search?keyword=disinfectant'

# set chrome driver options to disable any popup's from the website
# to find local path for chrome profile, open chrome browser
# and in the address bar type, "chrome://version"
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('start-maximized')
# chrome_options.add_argument('user-data-dir=C:\\Users\\username\\AppData\\Local\\Google\\Chrome\\User Data\\Default')
# To disable the message, "Chrome is being controlled by automated test software"
chrome_options.add_argument("disable-infobars")
# Pass the argument 1 to allow and 2 to block
chrome_options.add_experimental_option("prefs", {
    "profile.default_content_setting_values.notifications": 2
})


def get_url(search_term):
    """Generate an url from the search term"""
    template = "https://www.shopee.sg/search?keyword={}"
    search_term = search_term.replace(' ', '+')

    # add term query to url
    url = template.format(search_term)

    # add page query placeholder
    url += '&page={}'

    return url


def main(search_term):
    # invoke the webdriver
    driver = webdriver.Chrome(options=chrome_options)
    rows = []
    url = get_url(search_term)

    for page in range(0, 3):
        driver.get(url.format(page))
        WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.CLASS_NAME, "shopee-search-item-result__item")))
        driver.execute_script("""
        var scroll = document.body.scrollHeight / 10;
        var i = 0;
        function scrollit(i) {
           window.scrollBy({top: scroll, left: 0, behavior: 'smooth'});
           i++;
           if (i < 10) {
            setTimeout(scrollit, 500, i);
            }
        }
        scrollit(i);
        """)
        sleep(5)
        html = driver.page_source
        soup = BeautifulSoup(html, "html.parser")
        for item in soup.find_all('div', {'class': 'col-xs-2-4 shopee-search-item-result__item'}):
            name = item.find('div', {'class': 'yQmmFK _1POlWt _36CEnF'})
            if name is not None:
                name = name.text.strip()
            else:
                name = ''

            price = item.find('div', {'class': 'WTFwws _1lK1eK _5W0f35'})
            if price is not None:
                price = price.find('span', {'class': '_29R_un'}).text.strip()
            else:
                price = ''
            print([name, price])
            rows.append([name, price])

    with open('shopee_item_list.csv', 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['Product Description', 'Price'])
        writer.writerows(rows)

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/66747082

复制

相似问题

问用selenium和BeautifulSoup在python中实现Web抓取shopee.sg
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用selenium和BeautifulSoup在python中实现Web抓取shopee.sgEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用selenium和BeautifulSoup在python中实现Web抓取shopee.sg
EN