文章/答案/技术大牛

发布

社区首页 >问答首页 >从Vivino.com中抓取数据-葡萄酒信息和评论

问从Vivino.com中抓取数据-葡萄酒信息和评论
EN

Stack Overflow用户

提问于 2021-09-22 15:31:55

回答 1查看 753关注 0票数 1

为了写我的硕士论文，我需要收集数据。现在，我想从Vivino.com上收集数据，但我没有任何网络抓取的经验。我已经看到了一些关于这方面的问题，但我想收集所有关于葡萄酒的信息(名称，国家，评级，描述，价格等)。还有对葡萄酒的评论。

import requests
import pandas as pd

r = requests.get(
    "https://www.vivino.com/api/explore/explore",
    params = {
        "country_code": "FR",
        "country_codes[]":"pt",
        "currency_code":"EUR",
        "grape_filter":"varietal",
        "min_rating":"1",
        "order_by":"price",
        "order":"asc",
        "page": 1,
        "price_range_max":"500",
        "price_range_min":"0",
        "wine_type_ids[]":"1"
    },
    headers= {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0"
    }
)
results = [
    (
        t["vintage"]["wine"]["winery"]["name"], 
        f'{t["vintage"]["wine"]["name"]} {t["vintage"]["year"]}',
        t["vintage"]["statistics"]["ratings_average"],
        t["vintage"]["statistics"]["ratings_count"]
    )
    for t in r.json()["explore_vintage"]["matches"]
]
dataframe = pd.DataFrame(results,columns=['Winery','Wine','Rating','num_review'])

print(dataframe)

使用这段代码，我可以收集“葡萄酒厂”“葡萄酒”“评级”“num_review”。

使用下面的代码，我可以收集评论：

import re
import json
import requests

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0",
}


url = "https://www.vivino.com/FR/en/dauprat-pauillac/w/3823873?year=2017&price_id=24797287"
api_url = (
    "https://www.vivino.com/api/wines/{id}/reviews?per_page=9999&year={year}"
) # <-- increased the number of reviews to 9999

id_ = re.search(r"/(\d{5,})", url).group(1)
year = re.search(r"year=(\d+)", url).group(1)

data = requests.get(api_url.format(id=id_, year=year), headers=headers).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for r in data["reviews"]:
    print(r["note"])
    print("-" * 80)

有没有人能帮我把这些信息组合起来？那么，所有的葡萄酒信息，包括相应的评论？

提前谢谢你！！

pandas

web-scraping

web-scraping-language

python

回答 1

Stack Overflow用户

发布于 2021-09-22 16:28:59

要从第一个数据帧中获取关于葡萄酒的所有评论，可以使用下面的示例：

import requests
import pandas as pd


def get_wine_data(wine_id, year, page):
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0",
    }

    api_url = "https://www.vivino.com/api/wines/{id}/reviews?per_page=50&year={year}&page={page}"  # <-- increased the number of reviews to 9999

    data = requests.get(
        api_url.format(id=wine_id, year=year, page=page), headers=headers
    ).json()

    return data


r = requests.get(
    "https://www.vivino.com/api/explore/explore",
    params={
        "country_code": "FR",
        "country_codes[]": "pt",
        "currency_code": "EUR",
        "grape_filter": "varietal",
        "min_rating": "1",
        "order_by": "price",
        "order": "asc",
        "page": 1,
        "price_range_max": "500",
        "price_range_min": "0",
        "wine_type_ids[]": "1",
    },
    headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0"
    },
)

results = [
    (
        t["vintage"]["wine"]["winery"]["name"],
        t["vintage"]["year"],
        t["vintage"]["wine"]["id"],
        f'{t["vintage"]["wine"]["name"]} {t["vintage"]["year"]}',
        t["vintage"]["statistics"]["ratings_average"],
        t["vintage"]["statistics"]["ratings_count"],
    )
    for t in r.json()["explore_vintage"]["matches"]
]
dataframe = pd.DataFrame(
    results,
    columns=["Winery", "Year", "Wine ID", "Wine", "Rating", "num_review"],
)

ratings = []
for _, row in dataframe.iterrows():
    page = 1
    while True:
        print(
            f'Getting info about wine {row["Wine ID"]}-{row["Year"]} Page {page}'
        )

        d = get_wine_data(row["Wine ID"], row["Year"], page)

        if not d["reviews"]:
            break

        for r in d["reviews"]:
            ratings.append(
                [
                    row["Year"],
                    row["Wine ID"],
                    r["rating"],
                    r["note"],
                    r["created_at"],
                ]
            )

        page += 1

ratings = pd.DataFrame(
    ratings, columns=["Year", "Wine ID", "User Rating", "Note", "CreatedAt"]
)

df_out = ratings.merge(dataframe)
df_out.to_csv("data.csv", index=False)

创建data.csv (约40k条评论)(来自LibreOffice的屏幕截图)：

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/69287274

复制

相似问题

问从Vivino.com中抓取数据-葡萄酒信息和评论
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从Vivino.com中抓取数据-葡萄酒信息和评论EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从Vivino.com中抓取数据-葡萄酒信息和评论
EN