我正在尝试使用find_all()
,但似乎在查找特定信息的标记时遇到了问题。
我很乐意构建一个包装器,这样我就可以从应用程序商店中提取数据,比如标题、publisher等(公共HTML信息)。
代码不对,我知道。我能找到的最接近div
标识符的是"c4"
。
任何洞察力都有帮助。
# Imports
import requests
from bs4 import BeautifulSoup
# Data Defining
url = "https://play.google.com/store/search?q=weather%20app"
# Getting HTML
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
soup.get_text()
results = soup.find_all(id="c4")
我期待着不同天气应用和信息的输出:
Weather App 1
Develop Company 1
Google Weather App
Develop Company 2
Bing Weather App
Bing Developers
发布于 2022-04-03 18:02:49
我从url得到的输出
from bs4 import BeautifulSoup
import requests
url='https://play.google.com/store/search?q=weather%20app'
req=requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
cards= soup.find_all("div",class_="vU6FJ p63iDd")
for card in cards:
app_name= card.find("div",class_="WsMG1c nnK0zc").text
company = card.find("div",class_="KoLSrc").text
print("Name: " + app_name)
print("Company: " + company)
输出:
Name: Weather app
Company: Accurate Weather Forecast & Weather Radar Map
Name: AccuWeather: Weather Radar
Company: AccuWeather
Name: Weather Forecast - Accurate Local Weather & Widget
Company: Weather Forecast & Widget & Radar
Name: 1Weather Forecasts & Radar
Company: OneLouder Apps
Name: MyRadar Weather Radar
Company: ACME AtronOmatic LLC
Name: Weather data & microclimate : Weather Underground
Company: Weather Underground
Name: Weather & Widget - Weawow
Company: weawow weather app
Name: Weather forecast
Company: smart-pro android apps
Name: The Secret World of Weather: How to Read Signs in Every Cloud, Breeze, Hill, Street, Plant, Animal, and Dewdrop
Company: Tristan Gooley
Name: The Weather Machine: A Journey Inside the Forecast
Company: Andrew Blum
Name: The Mobile Mind Shift: Engineer Your Business to Win in the Mobile Moment
Company: Julie Ask
Name: Together: The Healing Power of Human Connection in a Sometimes Lonely World
Company: Vivek H. Murthy
Name: The Meadow
Company: James Galvin
Name: The Ancient Egyptian Culture Revealed, 2nd edition
Company: Moustafa Gadalla
Name: The Ancient Egyptian Culture Revealed, 2nd edition
Company: Moustafa Gadalla
Name: Chaos Theory
Company: Introbooks Team
Name: Survival Training: Killer Tips for Toughness and Secret Smart Survival Skills
Company: Wesley Jones
Name: Kiasunomics 2: Economic Insights for Everyday Life
Company: Ang Swee Hoon
Name: Summary of We Are The Weather by Jonathan Safran Foer
Company: QuickRead
Name: Learn Swift by Building Applications: Explore Swift programming through iOS app development
Company: Emil Atanasov
Name: Weather Hazard Warning Application in Car-to-X Communication: Concepts, Implementations, and Evaluations
Company: Attila Jaeger
Name: Mobile App Development with Ionic, Revised Edition: Cross-Platform Apps with Ionic,
Angular, and Cordova
Company: Chris Griffith
Name: Good Application Makes a Good Roof Better: A Simplified Guide: Installing Laminated
Asphalt Shingles for Maximum Life & Weather Protection
Company: ARMA Asphalt Roofing Manufacturers Association
Name: The Secret World of Weather: How to Read Signs in Every Cloud, Breeze, Hill, Street, Plant, Animal, and Dewdrop
Company: Tristan Gooley
Name: The Weather Machine: A Journey Inside the Forecast
Company: Andrew Blum
Name: Space Physics and Aeronomy, Space Weather Effects and Applications
Company: Book 5
Name: How to Build Android Apps with Kotlin: A hands-on guide to developing, testing, and
publishing your first apps with Android
Company: Alex Forrester
Name: Android 6 for Programmers: An App-Driven Approach, Edition 3
Company: Paul J. Deitel
发布于 2022-04-07 12:07:03
确保使用user-agent
作为“真正的”用户请求,因为有时您可以接收到具有不同元素和选择器的不同HTML,以及由于没有将user-agent
传递给请求标头而导致的某种错误。
user-agent
并在可能的时候更新它,因为如果user-agent
是旧的,网站可能会阻止请求,例如使用Chrome70版本。
此外,通过单击浏览器中所需的元素,查看SelectorGadget Chrome扩展以直观地获取CSS选择器。
更新06/06/2022.
谷歌最近改变了它的UI。现在Google Play Search返回的应用程序数量有限,即没有分页。
代码和在线IDE中的完整示例 (在Google更改后更新的代码):
from bs4 import BeautifulSoup
import requests, json, lxml, re
def bs4_scrape_google_play_store_search_apps(
query: str, filter_by: str = "apps", country: str = "US"
):
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": query, # search query
"gl": country, # country of the search. Different country display different apps.
"c": filter_by # filter to display list of apps. Other filters: apps, books, movies
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.79 Safari/537.36",
}
html = requests.get("https://play.google.com/store/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
apps_data = []
for app in soup.select("[jscontroller=tKHFxf]"):
title = app.select_one(".DdYX5").text
company = app.select_one(".wMUdtb").text
app_icon = app.select_one(".j2FCNc img")["srcset"]
try:
thumbnail = app.select_one(".Shbxxd img")["srcset"]
except:
thumbnail = app.select_one(".Vc0mnc img")["src"]
app_link = f'https://play.google.com{app.select_one(".Si6A0c.Gy4nib")["href"]}'
app_id = app.select_one("a")["href"].split("id=")[1]
try:
# https://regex101.com/r/SZLPRp/1
rating = re.search(r"\d{1}\.\d{1}", app.select_one(".ubGTjb div")["aria-label"]).group()
except:
rating = None
apps_data.append({
"title": title,
"app_link": app_link,
"company": company,
"rating": float(rating) if rating else rating, # float if rating is not None else rating or None
"app_id": app_id,
"thumbnail": thumbnail,
"icon": app_icon
})
print(json.dumps(apps_data, indent=2, ensure_ascii=False))
bs4_scrape_google_play_store_search_apps(query="maps", filter_by="apps", country="US")
产出的一部分:
[
{
"title": "Google Maps",
"app_link": "https://play.google.com/store/apps/details?id=com.google.android.apps.maps",
"company": "Google LLC",
"rating": 3.9,
"app_id": "com.google.android.apps.maps",
"thumbnail": "https://play-lh.googleusercontent.com/FQx43QTaAqeOtoTLylK3WIs7ySKuGS8AurXNA1Kj34m6w6CjavF4Oj3s5DB6xZZ7DS63=w832-h470-rw 2x",
"icon": "https://play-lh.googleusercontent.com/Kf8WTct65hFJxBUDm5E-EpYsiDoLQiGGbnuyP6HBNax43YShXti9THPon1YKB6zPYpA=s128-rw 2x"
}, ... other results
{
"title": "GPS, Maps, Voice Navigation & Directions",
"app_link": "https://play.google.com/store/apps/details?id=com.maps.voice.navigation.traffic.gps.location.route.driving.directions",
"company": "AppStar Studios",
"rating": 4.0,
"app_id": "com.maps.voice.navigation.traffic.gps.location.route.driving.directions",
"thumbnail": "https://i.ytimg.com/vi/4E2NyVZlOjc/hqdefault.jpg",
"icon": "https://play-lh.googleusercontent.com/NrK0b-e6cpj4yYkDuNZJHO9KUAl8pSj9TGi4Xw4GbPZ6UVsnAlLBH2AZuEMpb24Xig=s128-rw 2x"
}
]
另一种解决方案可以是使用来自Google Play Store API的SerpApi。这是一个有免费计划的付费API。
不同之处在于,不需要从头开始创建解析器、维护解析器、研究如何提取数据、绕过Google或其他搜索引擎的块。
合并守则:
from serpapi import GoogleSearch
import json
params = {
"api_key": "API KEY", # your serpapi api key
"engine": "google_play", # search engine
"hl": "en", # language
"store": "apps", # apps search
"gl": "us", # country to search from. Different country displays different.
"q": "weather" # search query
}
search = GoogleSearch(params) # where data extracts
results = search.get_dict() # JSON -> Python dictionary
apps_data = []
for apps in results["organic_results"]:
for app in apps["items"]:
apps_data.append({
"title": app.get("title"),
"link": app.get("link"),
"description": app.get("description"),
"product_id": app.get("product_id"),
"rating": app.get("rating"),
"thumbnail": app.get("thumbnail"),
})
print(json.dumps(apps_data, indent=2, ensure_ascii=False))
部分输出(包含在游乐场中可以看到的其他数据。):
[
{
"title": "Weather app",
"link": "https://play.google.com/store/apps/details?id=com.weather.forecast.weatherchannel",
"description": "The weather channel, tiempo weather forecast, weather radar & weather map",
"product_id": "com.weather.forecast.weatherchannel",
"rating": 4.7,
"thumbnail": "https://play-lh.googleusercontent.com/GdXjVGXQ90eVNpb1VoXWGT3pff2M9oe3yDdYGIsde7W9h3s2S6FDLfo1uO-gljBZ1QXO=s128-rw"
},
{
"title": "The Weather Channel - Radar",
"link": "https://play.google.com/store/apps/details?id=com.weather.Weather",
"description": "Weather Forecast & Snow Radar: local rain tracker, weather maps & alerts",
"product_id": "com.weather.Weather",
"rating": 4.6,
"thumbnail": "https://play-lh.googleusercontent.com/RV3DftXlA7WUV7w-BpE8zM0X7Y4RQd2vBvZVv6A01DEGb_eXFRjLmUhSqdbqrEl9klI=s128-rw"
},
{
"title": "AccuWeather: Weather Radar",
"link": "https://play.google.com/store/apps/details?id=com.accuweather.android",
"description": "Your local weather forecast, storm tracker, radar maps & live weather news",
"product_id": "com.accuweather.android",
"rating": 4.0,
"thumbnail": "https://play-lh.googleusercontent.com/EgDT3XrIaJbhZjINCWsiqjzonzqve7LgAbim8kHXWgg6fZnQebqIWjE6UcGahJ6yugU=s128-rw"
},
{
"title": "Weather by WeatherBug",
"link": "https://play.google.com/store/apps/details?id=com.aws.android",
"description": "The Most Accurate Weather Forecast. Alerts, Radar, Maps & News from WeatherBug",
"product_id": "com.aws.android",
"rating": 4.7,
"thumbnail": "https://play-lh.googleusercontent.com/_rZCkobaGZzXN3iquPr4u2KOe7C-ljnrSkBfw6sVL1kpUfq3sBl5MoRJEisBSnxaD-M=s128-rw"
}, ... other results
]
我也有一篇专门的用Python抓取Google播放搜索应用程序博客文章,其中有一个一步步的解释,这对这个答案来说太过分了。
免责声明,我为SerpApi工作。
https://stackoverflow.com/questions/71727849
复制相似问题