文章/答案/技术大牛

发布

社区首页 >问答首页 >美丽的汤选择google图像返回空列表

问美丽的汤选择google图像返回空列表
EN

Stack Overflow用户

提问于 2021-12-05 14:23:25

回答 2查看 264关注 0票数 3

我想使用谷歌艺术与文化从BeautifulSoup检索信息。我检查了许多堆栈溢出帖子([1]、[2]、[3]、[4]、[5])，但仍然无法检索信息。

我希望每个瓷砖(图片)的(li)信息，如href，然而，find_all和select one返回空列表或无。

你能帮我得到"e0WtYb HpzMff PJLMUc“类锚标记的以下href值吗?

href="/entity/claude-monet/m01xnj?categoryId=artist"

下面是我尝试过的。

import requests
from bs4 import BeautifulSoup

url = 'https://artsandculture.google.com/category/artist?tab=time&date=1850'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
print(soup.find_all('li', class_='DuHQbc'))                 # []
print(soup.find_all('a', class_='PJLMUc'))                  # []
print(soup.find_all('a', class_='e0WtYb HpzMff PJLMUc'))    # []
print(soup.select_one('#tab_time > div > div:nth-child(2) > div > ul > li:nth-child(2) > a'))  # None
for elem in soup.find_all('a', class_=['e0WtYb', 'HpzMff', 'PJLMUc'], href=True):
    print(elem)  # others with class 'e0WtYb'

...
# and then something like elem['href']

https://artsandculture.google.com/category/artist?tab=time&date=1850

从Chrome复制选择器

#tab_time > div > div:nth-child(2) > div > ul > li:nth-child(2) >a

python

beautifulsoup

web-crawler

回答 2

Stack Overflow用户

回答已采纳

发布于 2021-12-05 17:51:36

不幸的是，问题不在于您使用BeautifulSoup错误。您请求的网页似乎缺少了它的内容！我将html.text保存到一个文件中以供检查：

这一切为什么要发生？因为网页实际上使用JavaScript加载其内容。当您在浏览器中打开站点时，浏览器执行JavaScript，这会将所有的艺术家方块添加到网页中。(当你第一次加载站点时，你甚至会注意到广场不存在的短暂时刻。)另一方面，requests不执行JavaScript-它只是下载网页的内容并将它们保存到一个字符串中。

看你怎么办？不幸的是，这意味着抓取网站将是非常困难的。在这种情况下，我建议寻找另一种信息来源，或者使用网站提供的API。

票数 3

Stack Overflow用户

发布于 2022-09-06 15:02:35

要刮谷歌艺术和文化，你只能使用BeautifulSoup网络刮取库。但是，我们需要考虑到页面是动态的，并从解析HTML元素(CSS选择器等)中更改策略。使用正则表达式解析数据。

我们需要正则表达式，因为我们需要的信息来自服务器，并存储为内联JSON，用于通过JavaScript (猜测)呈现。首先，我们需要查看页面代码(CTRL + U)以找到匹配，如果匹配，则查看它们的确切位置。

由于有关三个选项卡(All、A、Time)的信息同时返回给我们，我们需要选择一部分JSON，使用正则表达式返回有关"Time“选项卡的信息，以查找匹配项并提取它们。例如，作者，指向作者的链接，以及绘画的数量。

下面是一个示例正则表达式，它从"Time“选项卡提取包含数据的部分内联JSON：

# https://regex101.com/r/4XAQ49/1
portion_of_script_tags = re.search("\[\"stella\.pr\",\"DatedAssets:.*\",\[\[\"stella\.common\.cobject\",(.*?)\[\]\]\]\;<\/script>", str(all_script_tags)).group(1)

还需要注意，因为请求可能会被阻塞(如果使用requests作为默认的user-agent in requests库是一个python-requests。其他步骤可以是user-agent，例如，在PC、移动和平板电脑之间切换，以及在浏览器(如Chrome、Firefox、Safari、Edge等)之间切换。

提取54个作者和联机IDE中的代码的代码片段。

from bs4 import BeautifulSoup
import requests, json, re, lxml

# https://requests.readthedocs.io/en/latest/user/quickstart/#passing-parameters-in-urls
params = {
    "tab": "time",
    "date": "1850"
}

# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
    }

html = requests.get(f"https://artsandculture.google.com/category/artist", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml") 

author_results = []

all_script_tags = soup.select("script")
# https://regex101.com/r/4XAQ49/1
portion_of_script_tags = re.search("\[\"stella\.pr\",\"DatedAssets:.*\",\[\[\"stella\.common\.cobject\",(.*?)\[\]\]\]\;<\/script>", str(all_script_tags)).group(1)

# https://regex101.com/r/XXAbKH/1
authors = re.findall(r"\"((?!stella\.common\.cobject)\w.*?)\",\"\d+", str(portion_of_script_tags))

# https://regex101.com/r/K4K3iB/1
author_links = [f"https://artsandculture.google.com{link}" for link in re.findall("\"(/entity.*?)\"", str(portion_of_script_tags))]

# https://regex101.com/r/x6wwVJ/1
number_of_artworks = re.findall("\"(\d+).*?items\"", str(portion_of_script_tags))

for author, author_link, num_artworks in zip(authors, author_links, number_of_artworks):
    author_results.append({
        "author": author,
        "author_link": author_link,
        "number_of_artworks": num_artworks
    })

print(json.dumps(author_results, indent=2, ensure_ascii=False))

示例输出

[
  {
    "author": "Vincent van Gogh",
    "author_link": "https://artsandculture.google.com/entity/vincent-van-gogh/m07_m2?categoryId\\u003dartist",
    "number_of_artworks": "338"
  },
  {
    "author": "Claude Monet",
    "author_link": "https://artsandculture.google.com/entity/claude-monet/m01xnj?categoryId\\u003dartist",
    "number_of_artworks": "275"
  },
  {
    "author": "Paul Cézanne",
    "author_link": "https://artsandculture.google.com/entity/paul-cézanne/m063mx?categoryId\\u003dartist",
    "number_of_artworks": "301"
  },
  {
    "author": "Paul Gauguin",
    "author_link": "https://artsandculture.google.com/entity/paul-gauguin/m0h82x?categoryId\\u003dartist",
    "number_of_artworks": "380"
  },
  # ...
]

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/70235264

复制

相似问题

问美丽的汤选择google图像返回空列表
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问美丽的汤选择google图像返回空列表EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问美丽的汤选择google图像返回空列表
EN