首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >用Python分页搜索google搜索

用Python分页搜索google搜索
EN

Stack Overflow用户
提问于 2021-10-06 05:25:44
回答 2查看 1K关注 0票数 2

嗨,我有一个代码,它刮谷歌搜索结果,并返回我的链接,标题和描述的网页..。然而,问题是它只刮到了第一页。我想添加分页和多个页面。

有人能帮我弄清楚怎么加分页吗?我尝试了其他几个支持分页的例子,但问题是它们只返回url。如果有人能帮我解决这个问题,我会很感激的。

代码:

代码语言:javascript
复制
import requests
import urllib
import pandas as pd
from requests_html import HTML
from requests_html import HTMLSession


def get_source(url):
    """Return the source code for the provided URL. 

    Args: 
        url (string): URL of the page to scrape.

    Returns:
        response (object): HTTP response object from requests_html. 
    """

    try:
        session = HTMLSession()
        response = session.get(url)
        return response

    except requests.exceptions.RequestException as e:
        print(e)

def get_results(query):
    
    query = urllib.parse.quote_plus(query)
    response = get_source("https://www.google.co.uk/search?q=" + query)
    
    return response


def parse_results(response):
    
    css_identifier_result = ".tF2Cxc"
    css_identifier_title = "h3"
    css_identifier_link = ".yuRUbf a"
    css_identifier_text = ".IsZvec"
    
    results = response.html.find(css_identifier_result)

    output = []
    
    for result in results:

        item = {
            'title': result.find(css_identifier_title, first=True).text,
            'link': result.find(css_identifier_link, first=True).attrs['href'],
            'text': result.find(css_identifier_text, first=True).text
        }
        
        output.append(item)
        
    return output

def google_search(query):
    response = get_results(query)
    return parse_results(response)


query = input("Enter your value: ")
results = google_search(query)
results
EN

Stack Overflow用户

回答已采纳

发布于 2021-10-06 14:01:45

这是一个有用的例子。您可以任意增加或缩小页码范围。很抱歉迟了回答。我太忙了。

代码:

代码语言:javascript
复制
import scrapy
from scrapy.selector import Selector
from scrapy_selenium import SeleniumRequest

class ElonSpider(scrapy.Spider):
    name = 'elon'

    def start_requests(self):
        urls = ['https://www.google.co.uk/search?q=%22Elon+Musk%22&ei=dEldYYS5Cpea4-EPnZGByAU&start=0&sa=N&ved=2ahUKEwiEw8OQm7XzAhUXzTgGHZ1IAFk4ChDy0wN6BAgBEDk&biw=1366&bih=625&dpr=' +
                str(x)+'&productfilter=&sort=null' for x in range(1, 3)]
        for url in urls:

            yield SeleniumRequest(
                url=url,
                wait_time=6,
                callback=self.parse)

    def parse(self, response):

        boxs = response.xpath('//*[@class="tF2Cxc"]')
        for box in boxs:

            yield {
                'Title': box.xpath('.//*[@class="LC20lb DKV0Md"]/text()').get()
                }

    def spider_closed(self):
        self.driver.close()

settings.py文件:

您必须像这样更改/更新settings.py文件中的未注释部分。

代码语言:javascript
复制
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
# }

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#    'scrapy_sr.middlewares.ScrapySrSpiderMiddleware': 543,
# }

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    'scrapy_sr.middlewares.ScrapySrDownloaderMiddleware': 543,
# }

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
# }

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# ITEM_PIPELINES = {
#    'scrapy_sr.pipelines.ScrapySrPipeline': 300,
# }

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


# Middleware

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

# Selenium

SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
# '--headless' if using chrome instead of firefox
SELENIUM_DRIVER_ARGUMENTS = ['-headless']

输出:

代码语言:javascript
复制
{'Title': 'Elon Musk - Wikipedia'}
2021-10-06 19:45:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.google.co.uk/search?q=%22Elon+Musk%22&ei=dEldYYS5Cpea4-EPnZGByAU&start=0&sa=N&ved=2ahUKEwiEw8OQm7XzAhUXzTgGHZ1IAFk4ChDy0wN6BAgBEDk&biw=1366&bih=625&dpr=1&productfilter=&sort=null>
{'Title': 'Elon Musk (@elonmusk) | Twitter'}
2021-10-06 19:45:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.google.co.uk/search?q=%22Elon+Musk%22&ei=dEldYYS5Cpea4-EPnZGByAU&start=0&sa=N&ved=2ahUKEwiEw8OQm7XzAhUXzTgGHZ1IAFk4ChDy0wN6BAgBEDk&biw=1366&bih=625&dpr=1&productfilter=&sort=null>
{'Title': 'Elon Musk - Forbes'}
2021-10-06 19:45:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.google.co.uk/search?q=%22Elon+Musk%22&ei=dEldYYS5Cpea4-EPnZGByAU&start=0&sa=N&ved=2ahUKEwiEw8OQm7XzAhUXzTgGHZ1IAFk4ChDy0wN6BAgBEDk&biw=1366&bih=625&dpr=1&productfilter=&sort=null>
{'Title': 'Elon Musk | Tesla'}
2021-10-06 19:45:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.google.co.uk/search?q=%22Elon+Musk%22&ei=dEldYYS5Cpea4-EPnZGByAU&start=0&sa=N&ved=2ahUKEwiEw8OQm7XzAhUXzTgGHZ1IAFk4ChDy0wN6BAgBEDk&biw=1366&bih=625&dpr=1&productfilter=&sort=null>
{'Title': '@elonmusk • Instagram photos and videos'}
2021-10-06 19:45:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.google.co.uk/search?q=%22Elon+Musk%22&ei=dEldYYS5Cpea4-EPnZGByAU&start=0&sa=N&ved=2ahUKEwiEw8OQm7XzAhUXzTgGHZ1IAFk4ChDy0wN6BAgBEDk&biw=1366&bih=625&dpr=1&productfilter=&sort=null>
{'Title': 'Elon Musk | Biography, SpaceX, Tesla, & Facts | Britannica'}
2021-10-06 19:45:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.google.co.uk/search?q=%22Elon+Musk%22&ei=dEldYYS5Cpea4-EPnZGByAU&start=0&sa=N&ved=2ahUKEwiEw8OQm7XzAhUXzTgGHZ1IAFk4ChDy0wN6BAgBEDk&biw=1366&bih=625&dpr=2&productfilter=&sort=null>
{'Title': 'Elon Musk - Wikipedia'}
2021-10-06 19:45:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.google.co.uk/search?q=%22Elon+Musk%22&ei=dEldYYS5Cpea4-EPnZGByAU&start=0&sa=N&ved=2ahUKEwiEw8OQm7XzAhUXzTgGHZ1IAFk4ChDy0wN6BAgBEDk&biw=1366&bih=625&dpr=1&productfilter=&sort=null>
{'Title': 'Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future'}
2021-10-06 19:45:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.google.co.uk/search?q=%22Elon+Musk%22&ei=dEldYYS5Cpea4-EPnZGByAU&start=0&sa=N&ved=2ahUKEwiEw8OQm7XzAhUXzTgGHZ1IAFk4ChDy0wN6BAgBEDk&biw=1366&bih=625&dpr=1&productfilter=&sort=null>
{'Title': 'Elon Musk - CNBC'}
2021-10-06 19:45:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.google.co.uk/search?q=%22Elon+Musk%22&ei=dEldYYS5Cpea4-EPnZGByAU&start=0&sa=N&ved=2ahUKEwiEw8OQm7XzAhUXzTgGHZ1IAFk4ChDy0wN6BAgBEDk&biw=1366&bih=625&dpr=1&productfilter=&sort=null>
{'Title': 'Elon Musk - Tesla, Age & Family - Biography'}
2021-10-06 19:45:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.google.co.uk/search?q=%22Elon+Musk%22&ei=dEldYYS5Cpea4-EPnZGByAU&start=0&sa=N&ved=2ahUKEwiEw8OQm7XzAhUXzTgGHZ1IAFk4ChDy0wN6BAgBEDk&biw=1366&bih=625&dpr=2&productfilter=&sort=null>
{'Title': 'Elon Musk (@elonmusk) | Twitter'}
2021-10-06 19:45:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.google.co.uk/search?q=%22Elon+Musk%22&ei=dEldYYS5Cpea4-EPnZGByAU&start=0&sa=N&ved=2ahUKEwiEw8OQm7XzAhUXzTgGHZ1IAFk4ChDy0wN6BAgBEDk&biw=1366&bih=625&dpr=2&productfilter=&sort=null>
{'Title': 'Elon Musk - Forbes'}
2021-10-06 19:45:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.google.co.uk/search?q=%22Elon+Musk%22&ei=dEldYYS5Cpea4-EPnZGByAU&start=0&sa=N&ved=2ahUKEwiEw8OQm7XzAhUXzTgGHZ1IAFk4ChDy0wN6BAgBEDk&biw=1366&bih=625&dpr=2&productfilter=&sort=null>
{'Title': 'Elon Musk | Tesla'}
2021-10-06 19:45:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.google.co.uk/search?q=%22Elon+Musk%22&ei=dEldYYS5Cpea4-EPnZGByAU&start=0&sa=N&ved=2ahUKEwiEw8OQm7XzAhUXzTgGHZ1IAFk4ChDy0wN6BAgBEDk&biw=1366&bih=625&dpr=2&productfilter=&sort=null>
{'Title': '@elonmusk • Instagram photos and videos'}
2021-10-06 19:45:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.google.co.uk/search?q=%22Elon+Musk%22&ei=dEldYYS5Cpea4-EPnZGByAU&start=0&sa=N&ved=2ahUKEwiEw8OQm7XzAhUXzTgGHZ1IAFk4ChDy0wN6BAgBEDk&biw=1366&bih=625&dpr=2&productfilter=&sort=null>
{'Title': 'Elon Musk | Biography, SpaceX, Tesla, & Facts | Britannica'}
2021-10-06 19:45:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.google.co.uk/search?q=%22Elon+Musk%22&ei=dEldYYS5Cpea4-EPnZGByAU&start=0&sa=N&ved=2ahUKEwiEw8OQm7XzAhUXzTgGHZ1IAFk4ChDy0wN6BAgBEDk&biw=1366&bih=625&dpr=2&productfilter=&sort=null>
{'Title': 'Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future'}
2021-10-06 19:45:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.google.co.uk/search?q=%22Elon+Musk%22&ei=dEldYYS5Cpea4-EPnZGByAU&start=0&sa=N&ved=2ahUKEwiEw8OQm7XzAhUXzTgGHZ1IAFk4ChDy0wN6BAgBEDk&biw=1366&bih=625&dpr=2&productfilter=&sort=null>
{'Title': 'Elon Musk - CNBC'}
2021-10-06 19:45:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.google.co.uk/search?q=%22Elon+Musk%22&ei=dEldYYS5Cpea4-EPnZGByAU&start=0&sa=N&ved=2ahUKEwiEw8OQm7XzAhUXzTgGHZ1IAFk4ChDy0wN6BAgBEDk&biw=1366&bih=625&dpr=2&productfilter=&sort=null>
{'Title': 'Elon Musk - Tesla, Age & Family - Biography'}
2021-10-06 19:45:45 [scrapy.core.engine] INFO: Closing spider (finished)
2021-10-06 19:45:45 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:50161/session/aa8a20e9cebf8c1f4e8d47187031d540 {}
2021-10-06 19:45:45 [urllib3.connectionpool] DEBUG: http://127.0.0.1:50161 "DELETE /session/aa8a20e9cebf8c1f4e8d47187031d540 HTTP/1.1" 200 14
2021-10-06 19:45:45 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-10-06 19:45:48 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 964822,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 5.076303,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 10, 6, 13, 45, 45, 768405),
 'item_scraped_count': 18,
票数 1
EN
查看全部 2 条回答
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/69460239

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档