我想刮所有的评论,从谷歌游戏商店为一个特定的应用程序。我编写了以下脚本:
# App Reviews Scraper
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
url = "https://play.google.com/store/apps/details?id=com.android.chrome&hl=en&showAllReviews=true"
# make request
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(url)
SCROLL_PAUSE_TIME = 5
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
time.sleep(SCROLL_PAUSE_TIME)
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
# Get everything inside <html> tag including javscript
html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
soup = BeautifulSoup(html, 'html.parser')
reviewer = []
date = []
# review text
for span in soup.find_all("span", class_="X43Kjb"):
reviewer.append(span.text)
# review date
for span in soup.find_all("span", class_="p2TkOb"):
date.append(span.text)
print(len(reviewer))
print(len(date))
然而,它总是只显示203。有35,474,218份评论。那么,我如何下载所有的评论呢?
发布于 2021-12-12 10:10:51
wait=WebDriverWait(driver,1)
try:
wait.until(EC.element_to_be_clickable((By.XPATH,"//span[text()='Show More']"))).click()
except:
continue
只需添加这个以检查无限滚动中的show元素。
进口:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
发布于 2022-08-30 10:41:03
一种从play商店刮取应用程序数据的更简单的方法
!pip install google_play_scraper
from google_play_scraper import app
#US Market Google play store reviews
from google_play_scraper import Sort, reviews_all
us_reviews = reviews_all(
'add the app id here-using the string mentioned after id value in your code', # use the id from the play
store hyperlink that you have used above
sleep_milliseconds=0, # defaults to 0
lang='en', # defaults to 'en, can change to other lang as well'
country='us', # defaults to 'us'
sort=Sort.NEWEST, # defaults to Sort.MOST_RELEVANT
)
转换成数据帧
df = pd.DataFrame(np.array(us_reviews ),columns=['review'])
df = df.join(pd.DataFrame(df.pop('review').tolist()))
发布于 2022-11-02 13:38:57
由于谷歌的限制,我认为没有办法提取所有的评论。例如,com.collectorz.javamobile.android.books
应用程序有2470个评论,879实际上在滚动到评论的末尾显示,减少了64.41%的变化。
计算示例:
(879 - 2470)/2470 = -64.41% (64.41% decrease)
在Chrome工具中,滚动到评论的末尾:
$$(".X5PpBb")
[0 … 99]
[100 … 199]
[200 … 299]
[300 … 399]
[400 … 499]
[500 … 599]
[600 … 699]
[700 … 799]
[800 … 878]
length: 879
在新的UI中,出现了一个显示更多的按钮,执行可能会停止/卡住或抛出错误,从而减少评论。
要提取所有可用数据,您需要检查“查看所有评论”按钮是否存在。如果应用程序很少或没有评论,则该按钮可能不存在。如果按钮存在,则需要单击该按钮并等待数据加载:
# if "See all reviews" button present
if driver.find_element(By.CSS_SELECTOR, ".Jwxk6d .u4ICaf button"):
# clicking on the button
button = driver.find_element(By.CSS_SELECTOR, ".Jwxk6d .u4ICaf button")
driver.execute_script("arguments[0].click();", button)
# waiting a few sec to load comments
time.sleep(4)
当数据加载后,您需要滚动页面。您可以对页面滚动算法进行小更改。如果变量new_height
和old_height
相等,则程序将查找Show 按钮选择器。如果存在此按钮,则程序将单击该按钮并继续下一步:
if new_height == old_height:
try:
show_more = driver.find_element(By.XPATH, "//span[text()='Show More']")
driver.execute_script("arguments[0].click();", show_more)
time.sleep(1)
except:
break
代码与在线IDE中的完整示例
import time, lxml, re, json
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
URL = "https://play.google.com/store/apps/details?id=com.collectorz.javamobile.android.books&hl=en"
service = Service(ChromeDriverManager().install())
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--lang=en")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
driver = webdriver.Chrome(service=service, options=options)
driver.get(URL)
# if "See all reviews" button present
if driver.find_element(By.CSS_SELECTOR, ".Jwxk6d .u4ICaf button"):
# clicking on the button
button = driver.find_element(By.CSS_SELECTOR, ".Jwxk6d .u4ICaf button")
driver.execute_script("arguments[0].click();", button)
# waiting a few sec to load comments
time.sleep(4)
old_height = driver.execute_script("""
function getHeight() {
return document.querySelector('.fysCi').scrollHeight;
}
return getHeight();
""")
# scrolling
while True:
driver.execute_script("document.querySelector('.fysCi').scrollTo(0, document.querySelector('.fysCi').scrollHeight)")
time.sleep(1)
new_height = driver.execute_script("""
function getHeight() {
return document.querySelector('.fysCi').scrollHeight;
}
return getHeight();
""")
if new_height == old_height:
try:
# if "Show More" button present
show_more = driver.find_element(By.XPATH, "//span[text()='Show More']")
driver.execute_script("arguments[0].click();", show_more)
time.sleep(1)
except:
break
old_height = new_height
# done scrolling
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()
user_comments = []
# exctracting comments
for index, comment in enumerate(soup.select(".RHo1pe"), start=1):
comment_likes = comment.select_one(".AJTPZc")
user_comments.append({
"position": index,
"user_name": comment.select_one(".X5PpBb").text,
"user_avatar": comment.select_one(".gSGphe img").get("srcset").replace(" 2x", ""),
"user_comment": comment.select_one(".h3YV2d").text,
"comment_likes": comment_likes.text.split("people")[0].strip() if comment_likes else None,
"app_rating": re.search(r"\d+", comment.select_one(".iXRFPc").get("aria-label")).group(),
"comment_date": comment.select_one(".bp9Aid").text
})
print(json.dumps(user_comments, indent=2, ensure_ascii=False))
如果希望更快地提取评论,可以从Google产品评论API中使用SerpApi。它将绕过搜索引擎中的块,您不必从头创建解析器并对其进行维护。
在所有页面中分页并提取评论的代码示例:
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import os, json
params = {
# https://docs.python.org/3/library/os.html#os.getenv
'api_key': os.getenv('API_KEY'), # your serpapi api
"engine": "google_play_product", # serpapi parsing engine
"store": "apps", # app results
"gl": "us", # country of the search
"hl": "en", # language of the search
"product_id": "com.collectorz.javamobile.android.books" # app id
}
search = GoogleSearch(params) # where data extraction happens on the backend
reviews = []
while True:
results = search.get_dict() # JSON -> Python dict
for review in results["reviews"]:
reviews.append({
"title": review.get("title"),
"avatar": review.get("avatar"),
"rating": review.get("rating"),
"likes": review.get("likes"),
"date": review.get("date"),
"snippet": review.get("snippet"),
"response": review.get("response")
})
# pagination
if "next" in results.get("serpapi_pagination", {}):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination", {}).get("next")).query)))
else:
break
print(json.dumps(reviews, indent=2, ensure_ascii=False))
有一个在Python中抓取所有Google应用程序评论博客,详细展示了如何提取所有评论。
免责声明,我为SerpApi工作。
https://stackoverflow.com/questions/70322053
复制相似问题