文章/答案/技术大牛

发布

社区首页 >问答首页 >Python web-scraping using Beautifulsoup: lowes store

问Python web-scraping using Beautifulsoup: lowes store
EN

Stack Overflow用户

提问于 2020-08-20 07:10:43

回答 1查看 335关注 0票数 0

我对刮刮还不熟悉。我被要求从网站https://www.lowes.com/Lowes-Stores获取门店编号、城市和州的列表

下面是我到目前为止所尝试的内容。由于该结构没有属性，我不确定如何继续我的代码。请指点一下！

import requests
from bs4 import BeautifulSoup
import json
from pandas import DataFrame as df

url = "https://www.lowes.com/Lowes-Stores"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

page = requests.get(url, headers=headers)
page.encoding = 'ISO-885901'
soup = BeautifulSoup(page.text, 'html.parser')

lowes_list = soup.find_all(class_ = "list unstyled")
for i in lowes_list[:2]:
    print(i)

example = lowes_list[0]
example_content = example.contents
example_content

python

web-scraping

beautifulsoup

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-08-20 07:29:43

您已经在for循环中找到了包含状态存储查找所需链接的列表元素。您需要从每个"li“元素中的"a”标记获取href属性。

这只是第一步，因为您需要使用这些链接来获取每个州的商店结果。

由于您知道此状态链接结果的结构，因此可以简单地执行以下操作：

for i in lowes_list:
     list_items = i.find_all('li')
     for x in list_items:
         for link in x.find_all('a'):
             print(link['href'])

当然，有更有效的方法可以做到这一点，但列表非常小，这是有效的。

一旦您有了每个州的链接，您就可以为每个州创建另一个访问这些商店结果页面的请求。然后从每个州的页面上的搜索结果链接中获取href属性。这个

<a href="/store/AK-Anchorage/0289">Anchorage Lowe's</a>

包含城市和商店编号。

这里有一个完整的例子。我包含了很多注释来说明这些观点。

到了第27行，您几乎已经拥有了所有内容，但是您需要遵循每个州的链接。解决这些问题的一个好方法是，首先在打开开发工具的情况下在web浏览器中测试路径，观察HTML，这样您就可以很好地知道从哪里开始编写代码。

此脚本将获得所需的数据，但不提供任何数据表示。

import requests
from bs4 import BeautifulSoup as bs


url = "https://www.lowes.com/Lowes-Stores"
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
}

page = requests.get(url, headers=headers, timeout=5)
page.encoding = "ISO-885901"
soup = bs(page.text, "html.parser")

lowes_state_lists = soup.find_all(class_="list unstyled")

# we will store the links for each state in this array
state_stores_links = []

# now we populate the state_stores_links array by finding the href in each li tag
for ul in lowes_state_lists:
    list_items = ul.find_all("li")
    # now we have all the list items from the page, we have to extract the href
    for li in list_items:
        for link in li.find_all("a"):
            state_stores_links.append(link["href"])

# This next part is what the original question was missing, following the state links to their respective search result pages. 

# at this point we have to request a new page for each state and store the results
# you can use pandas, but an dict works too.
states_stores = {}


for link in state_stores_links:
    # splitting up the link on the / gives us the parts of the URLs.
    # by inspecting with Chrome DevTools, we can see that each state follows the same pattern (state name and state abbreviation)
    link_components = link.split("/")
    state_name = link_components[2]
    state_abbreviation = link_components[3]

    # let's use the state_abbreviation as the dict's key, and we will have a stores array that we can do reporting on
    # the type and shape of this dict is irrelevant at this point.  This example illustrates how to obtain the info you're after
    # in the end the states_stores[state_abbreviation]['stores'] array will dicts each with a store_number and a city key
    states_stores[state_abbreviation] = {"state_name": state_name, "stores": []}

    try:
        # simple error catching in case something goes wrong, since we are sending many requests.
        # our link is just the second half of the URL, so we have to craft the new one.
        new_link = "https://www.lowes.com" + link
        state_search_results = requests.get(new_link, headers=headers, timeout=5)
        stores = []
        if state_search_results.status_code == 200:
            store_directory = bs(state_search_results.content, "html.parser")
            store_directory_div = store_directory.find("div", class_="storedirectory")
            # now we get the links inside the storedirectory div
            individual_store_links = store_directory_div.find_all("a")
            # we now have all the stores for this state! Let's parse and save them into our store dict
            # the store's city is after the state's abbreviation followed by a dash, the store number is the last thing in the link
            # example: "/store/AK-Wasilla/2512"
            for store in individual_store_links:
                href = store["href"]
                try:
                    # by splitting the href which looks to be consistent throughout the site, we can get the info we need
                    split_href = href.split("/")
                    store_number = split_href[3]
                    # the store city is after the -, so we have to split that element up into its two parts and access the second part.
                    store_city = split_href[2].split("-")[1]
                    # creating our store dict
                    store_object = {"city": store_city, "store_number": store_number}
                    # adding the dict to our state's dict
                    states_stores[state_abbreviation]["stores"].append(store_object)
                except Exception as e:
                    print(
                        "Error getting store info from {0}. Exception: {1}".format(
                            split_href, e
                        )
                    )

            # let's print something so we can confirm our script is working
            print(
                "State store count for {0} is: {1}".format(
                    states_stores[state_abbreviation]["state_name"],
                    len(states_stores[state_abbreviation]["stores"]),
                )
            )
        else:
            print(
                "Error fetching: {0}, error code: {1}".format(
                    link, state_search_results.status_code
                )
            )
    except Exception as e:
        print("Error fetching: {0}. Exception: {1}".format(state_abbreviation, e))

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/63496151

复制

相似问题

问Python web-scraping using Beautifulsoup: lowes store
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python web-scraping using Beautifulsoup: lowes storeEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python web-scraping using Beautifulsoup: lowes store
EN