为了写我的硕士论文,我需要收集数据。现在,我想从Vivino.com上收集数据,但我没有任何网络抓取的经验。我已经看到了一些关于这方面的问题,但我想收集所有关于葡萄酒的信息(名称,国家,评级,描述,价格等)。还有对葡萄酒的评论。
import requests
import pandas as pd
r = requests.get(
"https://www.vivino.com/api/explore/explore",
params = {
"country_code": "FR",
"country_codes[]":"pt",
"currency_code":"EUR",
"grape_filter":"varietal",
"min_rating":"1",
"order_by":"price",
"order":"asc",
"page": 1,
"price_range_max":"500",
"price_range_min":"0",
"wine_type_ids[]":"1"
},
headers= {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0"
}
)
results = [
(
t["vintage"]["wine"]["winery"]["name"],
f'{t["vintage"]["wine"]["name"]} {t["vintage"]["year"]}',
t["vintage"]["statistics"]["ratings_average"],
t["vintage"]["statistics"]["ratings_count"]
)
for t in r.json()["explore_vintage"]["matches"]
]
dataframe = pd.DataFrame(results,columns=['Winery','Wine','Rating','num_review'])
print(dataframe)
使用这段代码,我可以收集“葡萄酒厂”“葡萄酒”“评级”“num_review”。
使用下面的代码,我可以收集评论:
import re
import json
import requests
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0",
}
url = "https://www.vivino.com/FR/en/dauprat-pauillac/w/3823873?year=2017&price_id=24797287"
api_url = (
"https://www.vivino.com/api/wines/{id}/reviews?per_page=9999&year={year}"
) # <-- increased the number of reviews to 9999
id_ = re.search(r"/(\d{5,})", url).group(1)
year = re.search(r"year=(\d+)", url).group(1)
data = requests.get(api_url.format(id=id_, year=year), headers=headers).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for r in data["reviews"]:
print(r["note"])
print("-" * 80)
有没有人能帮我把这些信息组合起来?那么,所有的葡萄酒信息,包括相应的评论?
提前谢谢你!!
发布于 2021-09-22 16:28:59
要从第一个数据帧中获取关于葡萄酒的所有评论,可以使用下面的示例:
import requests
import pandas as pd
def get_wine_data(wine_id, year, page):
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0",
}
api_url = "https://www.vivino.com/api/wines/{id}/reviews?per_page=50&year={year}&page={page}" # <-- increased the number of reviews to 9999
data = requests.get(
api_url.format(id=wine_id, year=year, page=page), headers=headers
).json()
return data
r = requests.get(
"https://www.vivino.com/api/explore/explore",
params={
"country_code": "FR",
"country_codes[]": "pt",
"currency_code": "EUR",
"grape_filter": "varietal",
"min_rating": "1",
"order_by": "price",
"order": "asc",
"page": 1,
"price_range_max": "500",
"price_range_min": "0",
"wine_type_ids[]": "1",
},
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0"
},
)
results = [
(
t["vintage"]["wine"]["winery"]["name"],
t["vintage"]["year"],
t["vintage"]["wine"]["id"],
f'{t["vintage"]["wine"]["name"]} {t["vintage"]["year"]}',
t["vintage"]["statistics"]["ratings_average"],
t["vintage"]["statistics"]["ratings_count"],
)
for t in r.json()["explore_vintage"]["matches"]
]
dataframe = pd.DataFrame(
results,
columns=["Winery", "Year", "Wine ID", "Wine", "Rating", "num_review"],
)
ratings = []
for _, row in dataframe.iterrows():
page = 1
while True:
print(
f'Getting info about wine {row["Wine ID"]}-{row["Year"]} Page {page}'
)
d = get_wine_data(row["Wine ID"], row["Year"], page)
if not d["reviews"]:
break
for r in d["reviews"]:
ratings.append(
[
row["Year"],
row["Wine ID"],
r["rating"],
r["note"],
r["created_at"],
]
)
page += 1
ratings = pd.DataFrame(
ratings, columns=["Year", "Wine ID", "User Rating", "Note", "CreatedAt"]
)
df_out = ratings.merge(dataframe)
df_out.to_csv("data.csv", index=False)
创建data.csv
(约40k条评论)(来自LibreOffice的屏幕截图):
https://stackoverflow.com/questions/69287274
复制相似问题