$ pip install scrapy$ pip install pymysql
Spider类想要表达的是:如何抓取一个确定了的网站的数据。比如在start_urls里定义的去哪个链接抓取,parse()方法中定义的要抓取什么样的数据。 当一个Spider开始执行的时候,它首先从start_urls()中的第一个链接开始发起请求,然后在callback里处理返回的数据。
Item类提供格式化的数据,可以理解为数据Model类。
Scrapy的Selector类基于lxml库,提供HTML或XML转换功能。以response对象作为参数生成的Selector实例即可通过实例对象的xpath()方法获取节点的数据。
$ scrapy startproject book_scrapy
这个是创建一个名为 book_scrapy的项目
$ cd book_sacrpy/$ scrapy genspider book_spiser allitebooks.com
├── book_sacrpy│ ├── __init__.py│ ├── items.py│ ├── middlewares.py│ ├── pipelines.py│ ├── settings.py│ └── spiders│ ├── __init__.py│ └── book_spiser.py└── scrapy.cfg
提示:pycharm里面可以一步到位
# -*- coding: utf-8 -*- # Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.html import scrapy class BookItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() # pass title = scrapy.Field() isbn = scrapy.Field() price = scrapy.Field()
说明:
就是spider文件夹下的book_spider.py文件,具体代码如下,css,xpath的分析省略
# -*- coding: utf-8 -*-import scrapy from book_sacrpy.items import BookItem class BookSpiserSpider(scrapy.Spider): name = 'book_spiser' allowed_domains = ['allitebooks.com','amazon.com'] start_urls = ['http://allitebooks.com/security/',] def parse(self, response): num_pages = int(response.xpath('//a[contains(@title, "Last Page →")]/text()').extract_first()) base_url = "http://www.allitebooks.com/security/page/{0}/" for page in range(1,num_pages): yield scrapy.Request(base_url.format(page),dont_filter=True,callback=self.pare_page) def pare_page(self,response): for ever in response.css('.format-standard'): book_url = ever.css('.entry-thumbnail a::attr(href)').extract_first("") yield scrapy.Request(book_url,callback=self.pare_book_info) def pare_book_info(self,response): title = response.css('.single-title').xpath('text()').extract_first() isbn = response.xpath('//dd[2]/text()').extract_first('').replace(' ','') items = BookItem() items['title'] = title items['isbn'] = isbn amazon_price_url = 'https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=' + isbn yield scrapy.Request(amazon_price_url,callback=self.pare_book_price,meta={'items': items}) def pare_book_price(self,response): items = response.meta['items'] items['price'] = response.xpath('//span/text()').re(r'\$[0-9]+\.[0-9]{2}?')[0] yield items
说明:
爬取写入到一个csv文件
$ scrapy crawl book_sacrpy -o books.csv
我们暂时不用中间件,数据库自己提前设置好表头等信息
# -*- coding: utf-8 -*- # Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html import pymysqlfrom book_sacrpy.items import BookItem class BookIntodbPipeline(object): def __init__(self): self.conn = pymysql.connect("localhost","root","root","book",charset="utf8") self.cursor = self.conn.cursor() def process_item(self, item, spider): insert_sql = ''' insert into book(title,isbn,price) VALUES ('{}','{}','{}') ''' self.cursor.execute(insert_sql.format(item['title'],item['isbn'],item['price'])) self.conn.commit() # return item
ITEM_PIPELINES = { 'book_sacrpy.pipelines.BookIntodbPipeline': 300,}
将上面这一段注释掉,写入我们编写的那个pipeline,数字越大表示越靠后,里面可以写多个pipeline
$ scrapy crawl book_spiser
写一个run.py文件,代码如下:
# coding:utf8 from scrapy.cmdline import execute import sysimport os sys.path.append(os.path.dirname(os.path.abspath(__file__)))execute(["scrapy", "crawl", "book_spiser"])
以后只需要运行一个python run.py
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。