首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >使用Selenium和Python抓取Google应用程序的所有评论

使用Selenium和Python抓取Google应用程序的所有评论
EN

Stack Overflow用户
提问于 2021-12-12 08:30:23
回答 3查看 579关注 0票数 0

我想刮所有的评论,从谷歌游戏商店为一个特定的应用程序。我编写了以下脚本:

代码语言:javascript
运行
复制
# App Reviews Scraper
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

from bs4 import BeautifulSoup

url = "https://play.google.com/store/apps/details?id=com.android.chrome&hl=en&showAllReviews=true"

# make request
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(url)
SCROLL_PAUSE_TIME = 5

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
time.sleep(SCROLL_PAUSE_TIME)

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")

    if new_height == last_height:
        break
    last_height = new_height

# Get everything inside <html> tag including javscript
html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
soup = BeautifulSoup(html, 'html.parser')

reviewer = []
date = []

# review text
for span in soup.find_all("span", class_="X43Kjb"):
    reviewer.append(span.text)

# review date
for span in soup.find_all("span", class_="p2TkOb"):
    date.append(span.text)

print(len(reviewer))
print(len(date))

然而,它总是只显示203。有35,474,218份评论。那么,我如何下载所有的评论呢?

EN

回答 3

Stack Overflow用户

发布于 2021-12-12 10:10:51

代码语言:javascript
运行
复制
wait=WebDriverWait(driver,1)


try:
    wait.until(EC.element_to_be_clickable((By.XPATH,"//span[text()='Show More']"))).click()
except:
    continue

只需添加这个以检查无限滚动中的show元素。

进口:

代码语言:javascript
运行
复制
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC
票数 0
EN

Stack Overflow用户

发布于 2022-08-30 10:41:03

一种从play商店刮取应用程序数据的更简单的方法

代码语言:javascript
运行
复制
!pip install google_play_scraper 

from google_play_scraper import app

#US Market Google play store reviews
from google_play_scraper import Sort, reviews_all
 us_reviews = reviews_all(
'add the app id here-using the string mentioned after id value in your code', # use the id from the play 
 store hyperlink that you have used above
 sleep_milliseconds=0, # defaults to 0
 lang='en', # defaults to 'en, can change to other lang as well'
 country='us', # defaults to 'us'
 sort=Sort.NEWEST, # defaults to Sort.MOST_RELEVANT

)

转换成数据帧

代码语言:javascript
运行
复制
df = pd.DataFrame(np.array(us_reviews ),columns=['review'])
df = df.join(pd.DataFrame(df.pop('review').tolist()))
票数 0
EN

Stack Overflow用户

发布于 2022-11-02 13:38:57

由于谷歌的限制,我认为没有办法提取所有的评论。例如,com.collectorz.javamobile.android.books应用程序有2470个评论,879实际上在滚动到评论的末尾显示,减少了64.41%的变化。

计算示例:

代码语言:javascript
运行
复制
(879 - 2470)/2470 = -64.41% (64.41% decrease)

在Chrome工具中,滚动到评论的末尾:

代码语言:javascript
运行
复制
$$(".X5PpBb")
[0 … 99]
[100 … 199]
[200 … 299]
[300 … 399]
[400 … 499]
[500 … 599]
[600 … 699]
[700 … 799]
[800 … 878]
length: 879 

在新的UI中,出现了一个显示更多的按钮,执行可能会停止/卡住或抛出错误,从而减少评论。

要提取所有可用数据,您需要检查“查看所有评论”按钮是否存在。如果应用程序很少或没有评论,则该按钮可能不存在。如果按钮存在,则需要单击该按钮并等待数据加载:

代码语言:javascript
运行
复制
# if "See all reviews" button present
if driver.find_element(By.CSS_SELECTOR, ".Jwxk6d .u4ICaf button"):
    # clicking on the button
    button = driver.find_element(By.CSS_SELECTOR, ".Jwxk6d .u4ICaf button")
    driver.execute_script("arguments[0].click();", button)

    # waiting a few sec to load comments
    time.sleep(4)

当数据加载后,您需要滚动页面。您可以对页面滚动算法进行小更改。如果变量new_heightold_height相等,则程序将查找Show 按钮选择器。如果存在此按钮,则程序将单击该按钮并继续下一步:

代码语言:javascript
运行
复制
if new_height == old_height:
    try:
        show_more = driver.find_element(By.XPATH, "//span[text()='Show More']")
        driver.execute_script("arguments[0].click();", show_more)
        time.sleep(1)
    except:
        break

代码与在线IDE中的完整示例

代码语言:javascript
运行
复制
import time, lxml, re, json
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup

URL = "https://play.google.com/store/apps/details?id=com.collectorz.javamobile.android.books&hl=en"

service = Service(ChromeDriverManager().install())

options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--lang=en")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")

driver = webdriver.Chrome(service=service, options=options)
driver.get(URL)

# if "See all reviews" button present
if driver.find_element(By.CSS_SELECTOR, ".Jwxk6d .u4ICaf button"):
    # clicking on the button
    button = driver.find_element(By.CSS_SELECTOR, ".Jwxk6d .u4ICaf button")
    driver.execute_script("arguments[0].click();", button)

    # waiting a few sec to load comments
    time.sleep(4)

    old_height = driver.execute_script("""
        function getHeight() {
            return document.querySelector('.fysCi').scrollHeight;
        }
        return getHeight();
    """)

    # scrolling
    while True:
        driver.execute_script("document.querySelector('.fysCi').scrollTo(0, document.querySelector('.fysCi').scrollHeight)")
        time.sleep(1)

        new_height = driver.execute_script("""
            function getHeight() {
                return document.querySelector('.fysCi').scrollHeight;
            }
            return getHeight();
        """)

        if new_height == old_height:
            try:
                # if "Show More" button present
                show_more = driver.find_element(By.XPATH, "//span[text()='Show More']")
                driver.execute_script("arguments[0].click();", show_more)
                time.sleep(1)
            except:
                break

        old_height = new_height
    
    # done scrolling
    soup = BeautifulSoup(driver.page_source, 'lxml')
    driver.quit()

    user_comments = []
    
    # exctracting comments
    for index, comment in enumerate(soup.select(".RHo1pe"), start=1):
        comment_likes = comment.select_one(".AJTPZc")   
    
        user_comments.append({
            "position": index,
            "user_name": comment.select_one(".X5PpBb").text,
            "user_avatar": comment.select_one(".gSGphe img").get("srcset").replace(" 2x", ""),
            "user_comment": comment.select_one(".h3YV2d").text,
            "comment_likes": comment_likes.text.split("people")[0].strip() if comment_likes else None,
            "app_rating": re.search(r"\d+", comment.select_one(".iXRFPc").get("aria-label")).group(),
            "comment_date": comment.select_one(".bp9Aid").text
        })
    
    print(json.dumps(user_comments, indent=2, ensure_ascii=False))

如果希望更快地提取评论,可以从Google产品评论API中使用SerpApi。它将绕过搜索引擎中的块,您不必从头创建解析器并对其进行维护。

在所有页面中分页并提取评论的代码示例:

代码语言:javascript
运行
复制
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import os, json

params = {
    # https://docs.python.org/3/library/os.html#os.getenv
    'api_key': os.getenv('API_KEY'),                            # your serpapi api
    "engine": "google_play_product",                            # serpapi parsing engine
    "store": "apps",                                            # app results
    "gl": "us",                                                 # country of the search
    "hl": "en",                                                 # language of the search
    "product_id": "com.collectorz.javamobile.android.books"     # app id
}

search = GoogleSearch(params)       # where data extraction happens on the backend

reviews = []

while True:
    results = search.get_dict()     # JSON -> Python dict

    for review in results["reviews"]:
        reviews.append({
            "title": review.get("title"),
            "avatar": review.get("avatar"),
            "rating": review.get("rating"),
            "likes": review.get("likes"),
            "date": review.get("date"),
            "snippet": review.get("snippet"),
            "response": review.get("response")
        })

    # pagination
    if "next" in results.get("serpapi_pagination", {}):
        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination", {}).get("next")).query)))
    else:
        break
        
print(json.dumps(reviews, indent=2, ensure_ascii=False))

有一个在Python中抓取所有Google应用程序评论博客,详细展示了如何提取所有评论。

免责声明,我为SerpApi工作。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/70322053

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档