问抓取-已爬行-200引用-无
EN

Stack Overflow用户

提问于 2020-05-10 16:13:38

回答 1查看 109关注 0票数 0

我正在努力学习如何使用scrapy和python，但我根本不是专家……

在抓取此页面后，我有一个空文件：

so.news.com和我不明白为什么..。

下面是我的代码：

import scrapy

class XinhuaSpider(scrapy.Spider):
name = 'xinhua'
allowed_domains = ['xinhuanet.com']
start_urls = ['http://so.news.cn/?keyWordAll=&keyWordOne=%E6%96%B0%E5%86%A0+%E8%82%BA%E7%82%8E+%E6%AD%A6%E6%B1%89+%E7%97%85%E6%AF%92&keyWordIg=&searchFields=1&sortField=0&url=&senSearch=1&lang=cn#search/0/%E6%96%B0%E5%86%A0/1/']

def parse(self, response):
    #titles = response.css('#newsCon > div.newsList > div.news > h2 > a::text').extract()
    #date = response.css('#newsCon > div.newsList > div.news> div > p.newstime > span::text').extract()
    titles = response.xpath("/html/body/div[@id='search-result']/div[@class='resultCnt']/div[@id='resultList']/div[@class='newsListCnt secondlist']/div[@id='newsCon']/div[@class='newsList']/div[@class='news']/h2/a/text()").extract()
    date = response.xpath("/html/body/div[@id='search-result']/div[@class='resultCnt']/div[@id='resultList']/div[@class='newsListCnt secondlist']/div[@id='newsCon']/div[@class='newsList']/div[@class='news']/div[@class='easynews']/p[@class='newstime']/span/text()").extract()
    for item in zip(titles,date):
        scraped_info ={
            "title" : item[0],
            "date"  : item[1],                
        } 
        yield scraped_info

    nextPg = response.xpath("/html/body/div[@id='search-result']/div[@class='resultCnt']/div[@id='pagination']/a[@class='next']/@href").extract()
    if nextPg is not None:
        print(nextPg)

这是控制台中的消息：

2020-05-11 00:09:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://so.news.cn/?keyWordAll=&keyWordOne=%E6%96%B0%E5%86%A0+%E8%82%BA%E7%82%8E+%E6%AD%A6%E6%B1%89+%E7%97%85%E6%AF%92&keyWordIg=&searchFields=1&sortField=0&url=&senSearch=1&lang=cn#search/0/%E6%96%B0%E5%86%A0/1/> (referer: None)
[]

python

scrapy

回答 1

Stack Overflow用户

发布于 2020-05-10 16:34:48

您需要始终在浏览器中检查页面的源代码(Ctrl+U)。您在浏览器中看到的内容可以使用XHR Javascript调用加载。下面是适用于我的代码(我使用Chrome开发人员控制台找到了正确的启动url )：

import scrapy
import json
import re

class XinhuaSpider(scrapy.Spider):
    name = 'xinhua'
    # allowed_domains = ['xinhuanet.com']
    start_urls = ['http://so.news.cn/getNews?keyWordAll=&keyWordOne=%25E6%2596%25B0%25E5%2586%25A0%2B%25E8%2582%25BA%25E7%2582%258E%2B%25E6%25AD%25A6%25E6%25B1%2589%2B%25E7%2597%2585%25E6%25AF%2592&keyWordIg=&searchFields=1&sortField=0&url=&senSearch=1&lang=cn&keyword=%E6%96%B0%E5%86%A0&curPage=1']

    def parse(self, response):
        data = json.loads(response.body)
        for item in data["content"]["results"]:
            scraped_info ={
                "title" : item['title'],
                "date"  : item['pubtime'],                
            } 
            yield scraped_info

        current_page = data['content']['curPage']
        total_pages = data['content']['pageCount']
        if current_page < total_pages:
            next_page = re.sub(r'curPage=\d+', f"curPage={current_page + 1}", response.url)
            yield scrapy.Request(
                url=next_page,
                callback=self.parse,
            )