文章/答案/技术大牛

发布

社区首页 >问答首页 >使用Selenium和Python抓取Google应用程序的所有评论

问使用Selenium和Python抓取Google应用程序的所有评论
EN

Stack Overflow用户

提问于 2021-12-12 08:30:23

回答 3查看 579关注 0票数 0

我想刮所有的评论，从谷歌游戏商店为一个特定的应用程序。我编写了以下脚本：

# App Reviews Scraper
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

from bs4 import BeautifulSoup

url = "https://play.google.com/store/apps/details?id=com.android.chrome&hl=en&showAllReviews=true"

# make request
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(url)
SCROLL_PAUSE_TIME = 5

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
time.sleep(SCROLL_PAUSE_TIME)

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")

    if new_height == last_height:
        break
    last_height = new_height

# Get everything inside <html> tag including javscript
html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
soup = BeautifulSoup(html, 'html.parser')

reviewer = []
date = []

# review text
for span in soup.find_all("span", class_="X43Kjb"):
    reviewer.append(span.text)

# review date
for span in soup.find_all("span", class_="p2TkOb"):
    date.append(span.text)

print(len(reviewer))
print(len(date))

然而，它总是只显示203。有35,474,218份评论。那么，我如何下载所有的评论呢？

python

selenium

web-scraping

回答 3

Stack Overflow用户

发布于 2021-12-12 10:10:51

wait=WebDriverWait(driver,1)


try:
    wait.until(EC.element_to_be_clickable((By.XPATH,"//span[text()='Show More']"))).click()
except:
    continue

只需添加这个以检查无限滚动中的show元素。

进口：

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC

票数 0

Stack Overflow用户

发布于 2022-08-30 10:41:03

一种从play商店刮取应用程序数据的更简单的方法

!pip install google_play_scraper 

from google_play_scraper import app

#US Market Google play store reviews
from google_play_scraper import Sort, reviews_all
 us_reviews = reviews_all(
'add the app id here-using the string mentioned after id value in your code', # use the id from the play 
 store hyperlink that you have used above
 sleep_milliseconds=0, # defaults to 0
 lang='en', # defaults to 'en, can change to other lang as well'
 country='us', # defaults to 'us'
 sort=Sort.NEWEST, # defaults to Sort.MOST_RELEVANT

)

转换成数据帧

df = pd.DataFrame(np.array(us_reviews ),columns=['review'])
df = df.join(pd.DataFrame(df.pop('review').tolist()))

票数 0

Stack Overflow用户

发布于 2022-11-02 13:38:57

由于谷歌的限制，我认为没有办法提取所有的评论。例如，com.collectorz.javamobile.android.books应用程序有2470个评论，879实际上在滚动到评论的末尾显示，减少了64.41%的变化。

计算示例：

(879 - 2470)/2470 = -64.41% (64.41% decrease)

在Chrome工具中，滚动到评论的末尾：

$$(".X5PpBb")
[0 … 99]
[100 … 199]
[200 … 299]
[300 … 399]
[400 … 499]
[500 … 599]
[600 … 699]
[700 … 799]
[800 … 878]
length: 879

在新的UI中，出现了一个显示更多的按钮，执行可能会停止/卡住或抛出错误，从而减少评论。

要提取所有可用数据，您需要检查“查看所有评论”按钮是否存在。如果应用程序很少或没有评论，则该按钮可能不存在。如果按钮存在，则需要单击该按钮并等待数据加载：

# if "See all reviews" button present
if driver.find_element(By.CSS_SELECTOR, ".Jwxk6d .u4ICaf button"):
    # clicking on the button
    button = driver.find_element(By.CSS_SELECTOR, ".Jwxk6d .u4ICaf button")
    driver.execute_script("arguments[0].click();", button)

    # waiting a few sec to load comments
    time.sleep(4)

当数据加载后，您需要滚动页面。您可以对页面滚动算法进行小更改。如果变量new_height和old_height相等，则程序将查找Show 按钮选择器。如果存在此按钮，则程序将单击该按钮并继续下一步：

if new_height == old_height:
    try:
        show_more = driver.find_element(By.XPATH, "//span[text()='Show More']")
        driver.execute_script("arguments[0].click();", show_more)
        time.sleep(1)
    except:
        break

代码与在线IDE中的完整示例

import time, lxml, re, json
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup

URL = "https://play.google.com/store/apps/details?id=com.collectorz.javamobile.android.books&hl=en"

service = Service(ChromeDriverManager().install())

options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--lang=en")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")

driver = webdriver.Chrome(service=service, options=options)
driver.get(URL)

# if "See all reviews" button present
if driver.find_element(By.CSS_SELECTOR, ".Jwxk6d .u4ICaf button"):
    # clicking on the button
    button = driver.find_element(By.CSS_SELECTOR, ".Jwxk6d .u4ICaf button")
    driver.execute_script("arguments[0].click();", button)

    # waiting a few sec to load comments
    time.sleep(4)

    old_height = driver.execute_script("""
        function getHeight() {
            return document.querySelector('.fysCi').scrollHeight;
        }
        return getHeight();
    """)

    # scrolling
    while True:
        driver.execute_script("document.querySelector('.fysCi').scrollTo(0, document.querySelector('.fysCi').scrollHeight)")
        time.sleep(1)

        new_height = driver.execute_script("""
            function getHeight() {
                return document.querySelector('.fysCi').scrollHeight;
            }
            return getHeight();
        """)

        if new_height == old_height:
            try:
                # if "Show More" button present
                show_more = driver.find_element(By.XPATH, "//span[text()='Show More']")
                driver.execute_script("arguments[0].click();", show_more)
                time.sleep(1)
            except:
                break

        old_height = new_height
    
    # done scrolling
    soup = BeautifulSoup(driver.page_source, 'lxml')
    driver.quit()

    user_comments = []
    
    # exctracting comments
    for index, comment in enumerate(soup.select(".RHo1pe"), start=1):
        comment_likes = comment.select_one(".AJTPZc")   
    
        user_comments.append({
            "position": index,
            "user_name": comment.select_one(".X5PpBb").text,
            "user_avatar": comment.select_one(".gSGphe img").get("srcset").replace(" 2x", ""),
            "user_comment": comment.select_one(".h3YV2d").text,
            "comment_likes": comment_likes.text.split("people")[0].strip() if comment_likes else None,
            "app_rating": re.search(r"\d+", comment.select_one(".iXRFPc").get("aria-label")).group(),
            "comment_date": comment.select_one(".bp9Aid").text
        })
    
    print(json.dumps(user_comments, indent=2, ensure_ascii=False))

如果希望更快地提取评论，可以从Google产品评论API中使用SerpApi。它将绕过搜索引擎中的块，您不必从头创建解析器并对其进行维护。

在所有页面中分页并提取评论的代码示例：

from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import os, json

params = {
    # https://docs.python.org/3/library/os.html#os.getenv
    'api_key': os.getenv('API_KEY'),                            # your serpapi api
    "engine": "google_play_product",                            # serpapi parsing engine
    "store": "apps",                                            # app results
    "gl": "us",                                                 # country of the search
    "hl": "en",                                                 # language of the search
    "product_id": "com.collectorz.javamobile.android.books"     # app id
}

search = GoogleSearch(params)       # where data extraction happens on the backend

reviews = []

while True:
    results = search.get_dict()     # JSON -> Python dict

    for review in results["reviews"]:
        reviews.append({
            "title": review.get("title"),
            "avatar": review.get("avatar"),
            "rating": review.get("rating"),
            "likes": review.get("likes"),
            "date": review.get("date"),
            "snippet": review.get("snippet"),
            "response": review.get("response")
        })

    # pagination
    if "next" in results.get("serpapi_pagination", {}):
        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination", {}).get("next")).query)))
    else:
        break
        
print(json.dumps(reviews, indent=2, ensure_ascii=False))

有一个在Python中抓取所有Google应用程序评论博客，详细展示了如何提取所有评论。

免责声明，我为SerpApi工作。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/70322053

复制

相似问题

问使用Selenium和Python抓取Google应用程序的所有评论
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Selenium和Python抓取Google应用程序的所有评论EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Selenium和Python抓取Google应用程序的所有评论
EN