我对刮刮还不熟悉。我被要求从网站https://www.lowes.com/Lowes-Stores获取门店编号、城市和州的列表
下面是我到目前为止所尝试的内容。由于该结构没有属性,我不确定如何继续我的代码。请指点一下!
import requests
from bs4 import BeautifulSoup
import json
from pandas import DataFrame as df
url = "https://www.lowes.com/Lowes-Stores"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
page = requests.get(url, headers=headers)
page.encoding = 'ISO-885901'
soup = BeautifulSoup(page.text, 'html.parser')
lowes_list = soup.find_all(class_ = "list unstyled")
for i in lowes_list[:2]:
print(i)
example = lowes_list[0]
example_content = example.contents
example_content发布于 2020-08-20 07:29:43
您已经在for循环中找到了包含状态存储查找所需链接的列表元素。您需要从每个"li“元素中的"a”标记获取href属性。
这只是第一步,因为您需要使用这些链接来获取每个州的商店结果。
由于您知道此状态链接结果的结构,因此可以简单地执行以下操作:
for i in lowes_list:
list_items = i.find_all('li')
for x in list_items:
for link in x.find_all('a'):
print(link['href'])当然,有更有效的方法可以做到这一点,但列表非常小,这是有效的。
一旦您有了每个州的链接,您就可以为每个州创建另一个访问这些商店结果页面的请求。然后从每个州的页面上的搜索结果链接中获取href属性。这个
<a href="/store/AK-Anchorage/0289">Anchorage Lowe's</a>包含城市和商店编号。
这里有一个完整的例子。我包含了很多注释来说明这些观点。
到了第27行,您几乎已经拥有了所有内容,但是您需要遵循每个州的链接。解决这些问题的一个好方法是,首先在打开开发工具的情况下在web浏览器中测试路径,观察HTML,这样您就可以很好地知道从哪里开始编写代码。
此脚本将获得所需的数据,但不提供任何数据表示。
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.lowes.com/Lowes-Stores"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
}
page = requests.get(url, headers=headers, timeout=5)
page.encoding = "ISO-885901"
soup = bs(page.text, "html.parser")
lowes_state_lists = soup.find_all(class_="list unstyled")
# we will store the links for each state in this array
state_stores_links = []
# now we populate the state_stores_links array by finding the href in each li tag
for ul in lowes_state_lists:
list_items = ul.find_all("li")
# now we have all the list items from the page, we have to extract the href
for li in list_items:
for link in li.find_all("a"):
state_stores_links.append(link["href"])
# This next part is what the original question was missing, following the state links to their respective search result pages.
# at this point we have to request a new page for each state and store the results
# you can use pandas, but an dict works too.
states_stores = {}
for link in state_stores_links:
# splitting up the link on the / gives us the parts of the URLs.
# by inspecting with Chrome DevTools, we can see that each state follows the same pattern (state name and state abbreviation)
link_components = link.split("/")
state_name = link_components[2]
state_abbreviation = link_components[3]
# let's use the state_abbreviation as the dict's key, and we will have a stores array that we can do reporting on
# the type and shape of this dict is irrelevant at this point. This example illustrates how to obtain the info you're after
# in the end the states_stores[state_abbreviation]['stores'] array will dicts each with a store_number and a city key
states_stores[state_abbreviation] = {"state_name": state_name, "stores": []}
try:
# simple error catching in case something goes wrong, since we are sending many requests.
# our link is just the second half of the URL, so we have to craft the new one.
new_link = "https://www.lowes.com" + link
state_search_results = requests.get(new_link, headers=headers, timeout=5)
stores = []
if state_search_results.status_code == 200:
store_directory = bs(state_search_results.content, "html.parser")
store_directory_div = store_directory.find("div", class_="storedirectory")
# now we get the links inside the storedirectory div
individual_store_links = store_directory_div.find_all("a")
# we now have all the stores for this state! Let's parse and save them into our store dict
# the store's city is after the state's abbreviation followed by a dash, the store number is the last thing in the link
# example: "/store/AK-Wasilla/2512"
for store in individual_store_links:
href = store["href"]
try:
# by splitting the href which looks to be consistent throughout the site, we can get the info we need
split_href = href.split("/")
store_number = split_href[3]
# the store city is after the -, so we have to split that element up into its two parts and access the second part.
store_city = split_href[2].split("-")[1]
# creating our store dict
store_object = {"city": store_city, "store_number": store_number}
# adding the dict to our state's dict
states_stores[state_abbreviation]["stores"].append(store_object)
except Exception as e:
print(
"Error getting store info from {0}. Exception: {1}".format(
split_href, e
)
)
# let's print something so we can confirm our script is working
print(
"State store count for {0} is: {1}".format(
states_stores[state_abbreviation]["state_name"],
len(states_stores[state_abbreviation]["stores"]),
)
)
else:
print(
"Error fetching: {0}, error code: {1}".format(
link, state_search_results.status_code
)
)
except Exception as e:
print("Error fetching: {0}. Exception: {1}".format(state_abbreviation, e))https://stackoverflow.com/questions/63496151
复制相似问题