首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >从Vivino.com中抓取数据-葡萄酒信息和评论

从Vivino.com中抓取数据-葡萄酒信息和评论
EN

Stack Overflow用户
提问于 2021-09-22 15:31:55
回答 1查看 753关注 0票数 1

为了写我的硕士论文,我需要收集数据。现在,我想从Vivino.com上收集数据,但我没有任何网络抓取的经验。我已经看到了一些关于这方面的问题,但我想收集所有关于葡萄酒的信息(名称,国家,评级,描述,价格等)。还有对葡萄酒的评论。

代码语言:javascript
运行
复制
import requests
import pandas as pd

r = requests.get(
    "https://www.vivino.com/api/explore/explore",
    params = {
        "country_code": "FR",
        "country_codes[]":"pt",
        "currency_code":"EUR",
        "grape_filter":"varietal",
        "min_rating":"1",
        "order_by":"price",
        "order":"asc",
        "page": 1,
        "price_range_max":"500",
        "price_range_min":"0",
        "wine_type_ids[]":"1"
    },
    headers= {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0"
    }
)
results = [
    (
        t["vintage"]["wine"]["winery"]["name"], 
        f'{t["vintage"]["wine"]["name"]} {t["vintage"]["year"]}',
        t["vintage"]["statistics"]["ratings_average"],
        t["vintage"]["statistics"]["ratings_count"]
    )
    for t in r.json()["explore_vintage"]["matches"]
]
dataframe = pd.DataFrame(results,columns=['Winery','Wine','Rating','num_review'])

print(dataframe)

使用这段代码,我可以收集“葡萄酒厂”“葡萄酒”“评级”“num_review”。

使用下面的代码,我可以收集评论:

代码语言:javascript
运行
复制
import re
import json
import requests

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0",
}


url = "https://www.vivino.com/FR/en/dauprat-pauillac/w/3823873?year=2017&price_id=24797287"
api_url = (
    "https://www.vivino.com/api/wines/{id}/reviews?per_page=9999&year={year}"
) # <-- increased the number of reviews to 9999

id_ = re.search(r"/(\d{5,})", url).group(1)
year = re.search(r"year=(\d+)", url).group(1)

data = requests.get(api_url.format(id=id_, year=year), headers=headers).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for r in data["reviews"]:
    print(r["note"])
    print("-" * 80)

有没有人能帮我把这些信息组合起来?那么,所有的葡萄酒信息,包括相应的评论?

提前谢谢你!!

EN

回答 1

Stack Overflow用户

发布于 2021-09-22 16:28:59

要从第一个数据帧中获取关于葡萄酒的所有评论,可以使用下面的示例:

代码语言:javascript
运行
复制
import requests
import pandas as pd


def get_wine_data(wine_id, year, page):
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0",
    }

    api_url = "https://www.vivino.com/api/wines/{id}/reviews?per_page=50&year={year}&page={page}"  # <-- increased the number of reviews to 9999

    data = requests.get(
        api_url.format(id=wine_id, year=year, page=page), headers=headers
    ).json()

    return data


r = requests.get(
    "https://www.vivino.com/api/explore/explore",
    params={
        "country_code": "FR",
        "country_codes[]": "pt",
        "currency_code": "EUR",
        "grape_filter": "varietal",
        "min_rating": "1",
        "order_by": "price",
        "order": "asc",
        "page": 1,
        "price_range_max": "500",
        "price_range_min": "0",
        "wine_type_ids[]": "1",
    },
    headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0"
    },
)

results = [
    (
        t["vintage"]["wine"]["winery"]["name"],
        t["vintage"]["year"],
        t["vintage"]["wine"]["id"],
        f'{t["vintage"]["wine"]["name"]} {t["vintage"]["year"]}',
        t["vintage"]["statistics"]["ratings_average"],
        t["vintage"]["statistics"]["ratings_count"],
    )
    for t in r.json()["explore_vintage"]["matches"]
]
dataframe = pd.DataFrame(
    results,
    columns=["Winery", "Year", "Wine ID", "Wine", "Rating", "num_review"],
)

ratings = []
for _, row in dataframe.iterrows():
    page = 1
    while True:
        print(
            f'Getting info about wine {row["Wine ID"]}-{row["Year"]} Page {page}'
        )

        d = get_wine_data(row["Wine ID"], row["Year"], page)

        if not d["reviews"]:
            break

        for r in d["reviews"]:
            ratings.append(
                [
                    row["Year"],
                    row["Wine ID"],
                    r["rating"],
                    r["note"],
                    r["created_at"],
                ]
            )

        page += 1

ratings = pd.DataFrame(
    ratings, columns=["Year", "Wine ID", "User Rating", "Note", "CreatedAt"]
)

df_out = ratings.merge(dataframe)
df_out.to_csv("data.csv", index=False)

创建data.csv (约40k条评论)(来自LibreOffice的屏幕截图):

票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/69287274

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档