前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >安装和使用Scrapy

安装和使用Scrapy

作者头像
用户8442333
修改2021-05-21 11:17:20
4610
修改2021-05-21 11:17:20
举报
文章被收录于专栏:python知识

可以先创建虚拟环境并在虚拟环境下使用pip安装scrapy。

代码语言:javascript
复制
$ 

项目的目录结构如下图所示。

代码语言:javascript
复制
(venv) $ tree
.
|____ scrapy.cfg
|____ douban
| |____ spiders
| | |____ __init__.py
| | |____ __pycache__
| |____ __init__.py
| |____ __pycache__
| |____ middlewares.py
| |____ settings.py
| |____ items.py
| |____ pipelines.py

说明:Windows系统的命令行提示符下有tree命令,但是Linux和MacOS的终端是没有tree命令的,可以用下面给出的命令来定义tree命令,其实是对find命令进行了定制并别名为tree。 alias tree="find . -print | sed -e 's;[^/]*/;|____;g;s;____|; |;g'" Linux系统也可以通过yum或其他的包管理工具来安装tree。 yum install tree

根据刚才描述的数据处理流程,基本上需要我们做的有以下几件事情:

  1. 在items.py文件中定义字段,这些字段用来保存数据,方便后续的操作。 # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class DoubanItem(scrapy.Item): name = scrapy.Field() year = scrapy.Field() score = scrapy.Field() director = scrapy.Field() classification = scrapy.Field() actor = scrapy.Field()
  2. 在spiders文件夹中编写自己的爬虫。 (venv) $ scrapy genspider movie movie.douban.com --template=crawl # -*- coding: utf-8 -*- import scrapy from scrapy.selector import Selector from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from douban.items import DoubanItem class MovieSpider(CrawlSpider): name = 'movie' allowed_domains = ['movie.douban.com'] start_urls = ['https://movie.douban.com/top250'] rules = ( Rule(LinkExtractor(allow=(r'https://movie.douban.com/top250\?start=\d+.*'))), Rule(LinkExtractor(allow=(r'https://movie.douban.com/subject/\d+')), callback='parse_item'), ) def parse_item(self, response): sel = Selector(response) item = DoubanItem() item['name']=sel.xpath('//*[@id="content"]/h1/span[1]/text()').extract() item['year']=sel.xpath('//*[@id="content"]/h1/span[2]/text()').re(r'\((\d+)\)') item['score']=sel.xpath('//*[@id="interest_sectl"]/div/p[1]/strong/text()').extract() item['director']=sel.xpath('//*[@id="info"]/span[1]/a/text()').extract() item['classification']= sel.xpath('//span[@property="v:genre"]/text()').extract() item['actor']= sel.xpath('//*[@id="info"]/span[3]/a[1]/text()').extract() return item 说明:上面我们通过Scrapy提供的爬虫模板创建了Spider,其中的rules中的LinkExtractor对象会自动完成对新的链接的解析,该对象中有一个名为extract_link的回调方法。Scrapy支持用XPath语法和CSS选择器进行数据解析,对应的方法分别是xpath和css,上面我们使用了XPath语法对页面进行解析,如果不熟悉XPath语法可以看看后面的补充说明。 到这里,我们已经可以通过下面的命令让爬虫运转起来。 (venv)$ scrapy crawl movie 可以在控制台看到爬取到的数据,如果想将这些数据保存到文件中,可以通过-o参数来指定文件名,Scrapy支持我们将爬取到的数据导出成JSON、CSV、XML、pickle、marshal等格式。 (venv)$ scrapy crawl moive -o result.json
  3. 在pipelines.py中完成对数据进行持久化的操作。 # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import pymongo from scrapy.exceptions import DropItem from scrapy.conf import settings from scrapy import log class DoubanPipeline(object): def __init__(self): connection = pymongo.MongoClient(settings['MONGODB_SERVER'], settings['MONGODB_PORT']) db = connection[settings['MONGODB_DB']] self.collection = db[settings['MONGODB_COLLECTION']] def process_item(self, item, spider): #Remove invalid data valid = True for data in item: if not data: valid = False raise DropItem("Missing %s of blogpost from %s" %(data, item['url'])) if valid: #Insert data into database new_moive=[{ "name":item['name'][0], "year":item['year'][0], "score":item['score'], "director":item['director'], "classification":item['classification'], "actor":item['actor'] }] self.collection.insert(new_moive) log.msg("Item wrote to MongoDB database %s/%s" % (settings['MONGODB_DB'], settings['MONGODB_COLLECTION']), level=log.DEBUG, spider=spider) return item 利用Pipeline我们可以完成以下操作:
    • 清理HTML数据,验证爬取的数据。
    • 丢弃重复的不必要的内容。
    • 将爬取的结果进行持久化操作。
  4. 修改settings.py文件对项目进行配置。 # -*- coding: utf-8 -*- # Scrapy settings for douban project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'douban' SPIDER_MODULES = ['douban.spiders'] NEWSPIDER_MODULE = 'douban.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5' # Obey robots.txt rules ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16) # CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 3 RANDOMIZE_DOWNLOAD_DELAY = True # The download delay setting will honor only one of: # CONCURRENT_REQUESTS_PER_DOMAIN = 16 # CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) COOKIES_ENABLED = True MONGODB_SERVER = '120.77.222.217' MONGODB_PORT = 27017 MONGODB_DB = 'douban' MONGODB_COLLECTION = 'movie' # Disable Telnet Console (enabled by default) # TELNETCONSOLE_ENABLED = False # Override the default request headers: # DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', # } # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html # SPIDER_MIDDLEWARES = { # 'douban.middlewares.DoubanSpiderMiddleware': 543, # } # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # DOWNLOADER_MIDDLEWARES = { # 'douban.middlewares.DoubanDownloaderMiddleware': 543, # } # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html # EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, # } # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'douban.pipelines.DoubanPipeline': 400, } LOG_LEVEL = 'DEBUG' # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings HTTPCACHE_ENABLED = True HTTPCACHE_EXPIRATION_SECS = 0 HTTPCACHE_DIR = 'httpcache' HTTPCACHE_IGNORE_HTTP_CODES = [] HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

补充说明

本文系转载,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文系转载前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 补充说明
相关产品与服务
数据库
云数据库为企业提供了完善的关系型数据库、非关系型数据库、分析型数据库和数据库生态工具。您可以通过产品选择和组合搭建,轻松实现高可靠、高可用性、高性能等数据库需求。云数据库服务也可大幅减少您的运维工作量,更专注于业务发展,让企业一站式享受数据上云及分布式架构的技术红利!
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档