Python: “股票数据Scrapy爬虫”实例

Exploring

发布于 2022-09-20 14:01:05

9790

发布于 2022-09-20 14:01:05

文章被收录于专栏：数据处理与编程实践

文章背景：之前基于requests-bs4-re的技术路线（参加文末的延伸阅读），获取沪深两市A股所有股票的名称和交易信息，并保存到文件中。本文采用scrapy模块，进行股票数据的爬虫。

技术路线：scrapy

代码运行环境：win10 + JupyterLab

1 数据网站的确定

选取原则：股票信息静态存在于HTML页面中，非Js代码生成。

选取方法：浏览器F12，查看源文件等

选取心态：不要纠结于某个网站，多找信息源。

（1）获取股票列表：

炒股一点通：http://www.cgedt.com/stockcode/yilanbiao.asp

（2）获取个股信息：

股城网：https://hq.gucheng.com/HSinfo.html

单个股票：https://hq.gucheng.com/SH600050/

https://hq.gucheng.com/SZ002276/

2 设计思路

建立工程和Spider模板
编写Spider
编写ITEM Pipelines

3 代码实现

(1) 建立工程和Spider模板（JupyterLab）

import scrapy, os
os.chdir("E:\\python123\\网络爬虫")

!scrapy startproject GuchengStocks

(2.1) 创建Spider（JupyterLab）

import scrapy,os

os.chdir("E:\python123\网络爬虫\GuchengStocks")

!scrapy genspider stocks hq.gucheng.com

(2.2) 编写Spider(修改stocks.py文件的代码)

# -*- coding: utf-8 -*-
# stocks.py

import scrapy, re

class StocksSpider(scrapy.Spider):
    name = "stocks"
    start_urls = ['http://www.cgedt.com/stockcode/yilanbiao.asp']

    def parse(self, response):
        for href in response.css('a::attr(href)').extract():
            try:
                temp = re.findall(r"/stock/\d{6}/", href)[0]
                if temp[7] == "6":
                    stock = "SH" + temp[7:13]
                else:
                    stock = "SZ" + temp[7:13]
                url = 'https://hq.gucheng.com/' + stock
                yield scrapy.Request(url, callback=self.parse_stock)
            except:
                continue

    def parse_stock(self, response):
        infoDict = {}
        stockInfo1 = response.css('.stock_title')
        name1 = stockInfo1.css('h1').extract()[0]
        name2 = stockInfo1.css('h2').extract()[0]
        
        stockInfo2 = response.css('.stock_price.clearfix')
        keyList = stockInfo2.css('dt').extract()
        valueList = stockInfo2.css('dd').extract()
        for i in range(len(keyList)):
            key = re.findall(r'>.*</dt>', keyList[i])[0][1:-5]
            try:
                val = re.findall(r'\d+\.?.*</dd>', valueList[i])[0][0:-5]
            except:
                val = '--'
            infoDict[key]=val

        infoDict.update(
            {'股票名称': re.findall('>.*</h1>',name1)[0][1:-5] + \
             re.findall('>.*</h2>', name2)[0][1:-5]})
        yield infoDict

(3.1) 编写Pipelines(修改pinelines.py文件的代码)

定义对爬取项（Scraped Item）的处理类

from itemadapter import ItemAdapter

# pipeline.py
class GuchengstocksPipeline:
    def process_item(self, item, spider):
        return item

class GuchengstocksInfoPipeline:
    def open_spider(self, spider):
        self.f = open('GuchengStockInfo.txt', 'w')

    def close_spider(self, spider):
        self.f.close()

    def process_item(self, item, spider):
        try:
            line = str(dict(item)) + '\n'
            self.f.write(line)
        except:
            pass
        return item

(3.2) 配置ITEM_Pipelines选项(修改settings.py文件的代码)

# settings.py
ITEM_PIPELINES = {
    'GuchengStocks.pipelines.GuchengstocksInfoPipeline': 300,
}

(4) 运行爬虫（命令提示符窗口）

运行结果：

参考资料：

[1] 中国大学MOOC: Python网络爬虫与信息提取(https://www.icourse163.org/course/BIT-1001870001)

[2] Scrapy css选择器提取数据(https://www.cnblogs.com/runningRain/p/12741095.html)

[3] python中回调函数，callback的含义(https://blog.csdn.net/qq_37849776/article/details/88407371)

[4] scrapy--解决css选择器遇见含空格类提取问题response.css(https://blog.csdn.net/liuhehe123/article/details/81608225)

延伸阅读：

[1] Python: “股票数据定向爬虫”实例

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2020-12-25，如有侵权请联系 cloudcommunity@tencent.com 删除

https

本文分享自数据处理与编程实践微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

0 条评论

热度