前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Scrapy框架+Elasticsearch

Scrapy框架+Elasticsearch

作者头像
Stanley Sun
发布2019-09-23 15:20:31
1.2K0
发布2019-09-23 15:20:31
举报

前提

1. 已安装scrapy框架

2. 已安装elasticsearch

创建一个项目scrapyes

scrapy startproject scrapyes

目录结构

.
|____scrapy.cfg
|____scrapyes
| |______init__.py
| |____items.py
| |____middlewares.py
| |____pipelines.py
| |____settings.py
| |____spiders
| | |______init__.py

安装ScrapyElasticSearch

pip install ScrapyElasticSearch

配置setting.py

...

ITEM_PIPELINES = {
  'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline': 300,
}

ELASTICSEARCH_SERVERS = ['192.168.4.215']
ELASTICSEARCH_PORT = 9200 # If port 80 leave blank
ELASTICSEARCH_USERNAME = ''
ELASTICSEARCH_PASSWORD = ''
ELASTICSEARCH_INDEX = 'scrapy.course'
ELASTICSEARCH_TYPE = 'course'
ELASTICSEARCH_UNIQ_KEY = 'url'

...

配置说明见 https://github.com/knockrentals/scrapy-elasticsearch

写一个网络课程爬虫

import scrapy

class ESCourseSpider(scrapy.Spider):
    name = 'es_course'

    def start_requests(self):
        urls=[]
        for i in xrange(1,30):
            urls.append('http://demo.edusoho.com/course/'+str(i))

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)


    def parse(self, response):
        yield {
            'title': response.css('span.course-detail-heading::text').extract_first(),
            'price': response.css('b.pirce-num::text').extract_first(),
            'url' : response.url,
        }

跑一下爬虫

scrapy crawl es_course -o es_course.json

爬下来的内容会存放在新生成的一个文件es_course.json里

[
{"url": "http://demo.edusoho.com/course/1", "price": "免费", "title": "\n               课程功能体验\n                        "},
{"url": "http://demo.edusoho.com/course/20", "price": "0.01", "title": "\n               官方主题\n                        "},
{"url": "http://demo.edusoho.com/course/24", "price": "999.00", "title": "\n               会员专区\n                        "},
{"url": "http://demo.edusoho.com/course/22", "price": "免费", "title": "\n               第三方主题\n                        "},
{"url": "http://demo.edusoho.com/course/27", "price": "0.01", "title": "\n               优惠码\n                        "}
]

到elasticsearch中查看数据,查询条件如下

GET scrapy.course*/_search
{
  "query" : {
    "match_all": {}
  }
  ,"from" : 0, "size" : 50
}

结果

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 1,
    "hits": [
      {
        "_index": "scrapy.course",
        "_type": "course",
        "_id": "6306093149d91c35eabc1c59f28d68355cc4de9d",
        "_score": 1,
        "_source": {
          "url": "http://demo.edusoho.com/course/1",
          "price": "免费",
          "title": "\n               课程功能体验\n                        "
        }
      },
      {
        "_index": "scrapy.course",
        "_type": "course",
        "_id": "6a090cfe8f9dbf3d21248d64d9907eab4b31bc4d",
        "_score": 1,
        "_source": {
          "url": "http://demo.edusoho.com/course/24",
          "price": "999.00",
          "title": "\n               会员专区\n                        "
        }
      },

...

说明数据已经存到elasticsearch中。

本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 前提
  • 创建一个项目scrapyes
  • 安装ScrapyElasticSearch
  • 写一个网络课程爬虫
  • 跑一下爬虫
相关产品与服务
Elasticsearch Service
腾讯云 Elasticsearch Service(ES)是云端全托管海量数据检索分析服务,拥有高性能自研内核,集成X-Pack。ES 支持通过自治索引、存算分离、集群巡检等特性轻松管理集群,也支持免运维、自动弹性、按需使用的 Serverless 模式。使用 ES 您可以高效构建信息检索、日志分析、运维监控等服务,它独特的向量检索还可助您构建基于语义、图像的AI深度应用。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档