Scrapy框架中分两类爬虫,Spider类和CrawlSpider类。该案例采用的是CrawlSpider类实现爬虫进行全站抓取。
CrawlSpider是Spider的派生类,Spider类的设计原则是只爬取start_url列表中的网页,而CrawlSpider类定义了一些规则(rule)来提供跟进link的方便的机制,从爬取的网页中获取link并继续爬取。
创建CrawlSpider模板:
scrapy genspider -t crawl spider名称 www.xxxx.com
LinkExtractors:Link Extractors 的目的是提取链接,调用的是extract_links(),其提供了过滤器(filter),以便于提取包括符合正则表达式的链接。 过滤器通过以下构造函数的参数配置:
allow (a regular expression (or list of)) – 必须要匹配这个正则表达式(或正则表达式列表)的URL才会被提取。如果没有给出(或为空), 它会匹配所有的链接。 deny (a regular expression (or list of)) – 与这个正则表达式(或正则表达式列表)的(绝对)不匹配的URL必须被排除在外(即不提取)。它的优先级高于 allow 的参数。如果没有给出(或None), 将不排除任何链接。 allow_domains (str or list) – 单值或者包含字符串域的列表表示会被提取的链接的domains。 deny_domains (str or list) – 单值或包含域名的字符串,将不考虑提取链接的domains。 deny_extensions (list) – 应提取链接时,可以忽略扩展名的列表。如果没有给出, 它会默认为 scrapy.linkextractor 模块中定义的 IGNORED_EXTENSIONS 列表。 restrict_xpaths (str or list) – 一个的XPath (或XPath的列表),它定义了链路应该从提取的响应内的区域。如果给定的,只有那些XPath的选择的文本将被扫描的链接。见下面的例子。 tags (str or list) – 提取链接时要考虑的标记或标记列表。默认为 ( 'a' , 'area') 。 attrs (list) – 提取链接时应该寻找的attrbitues列表(仅在 tag 参数中指定的标签)。默认为 ('href')。 canonicalize (boolean) – 规范化每次提取的URL(使用scrapy.utils.url.canonicalize_url )。默认为 True 。 unique (boolean) – 重复过滤是否应适用于提取的链接。 process_value (callable) – 见:class:BaseSgmlLinkExtractor 类的构造函数 process_value 参数。
Rules:在rules中包含一个或多个Rule对象,每个Rule对爬取网站的动作定义了特定操作。如果多个rule匹配了相同的链接,则根据规则在本集合中被定义的顺序,第一个会被使用。
callback: 从link_extractor中每获取到链接时,参数所指定的值作为回调函数,该回调函数接受一个response作为其第一个参数。 注意:当编写爬虫规则时,避免使用parse作为回调函数。由于CrawlSpider使用parse方法来实现其逻辑,如果覆盖了 parse方法,crawl spider将会运行失败。 follow:是一个布尔(boolean)值,指定了根据该规则从response提取的链接是否需要跟进。 如果callback为None,follow 默认设置为True ,否则默认为False。 process_links:指定该spider中哪个的函数将会被调用,从link_extractor中获取到链接列表时将会调用该函数。该方法主要用来过滤。 process_request:指定该spider中哪个的函数将会被调用, 该规则提取到每个request时都会调用该函数。 (用来过滤request)
二、创建Scrapy工程
#scrapy startproject 工程名
scrapy startproject demo4
三、进入工程目录,根据爬虫模板生成爬虫文件
#scrapy genspider -l # 查看可用模板
#scrapy genspider -t 模板名 爬虫文件名 允许的域名
scrapy genspider -t crawl test sohu.com
四、设置IP池或用户代理(middlewares.py文件)
1 # -*- coding: utf-8 -*-
2 # 导入随机模块
3 import random
4 # 导入有关IP池有关的模块
5 from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware
6 # 导入有关用户代理有关的模块
7 from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
8
9 # IP池
10 class HTTPPROXY(HttpProxyMiddleware):
11 # 初始化 注意一定是 ip=''
12 def __init__(self, ip=''):
13 self.ip = ip
14
15 def process_request(self, request, spider):
16 item = random.choice(IPPOOL)
17 try:
18 print("当前的IP是:"+item["ipaddr"])
19 request.meta["proxy"] = "http://"+item["ipaddr"]
20 except Exception as e:
21 print(e)
22 pass
23
24
25 # 设置IP池
26 IPPOOL = [
27 {"ipaddr": "182.117.102.10:8118"},
28 {"ipaddr": "121.31.102.215:8123"},
29 {"ipaddr": "1222.94.128.49:8118"}
30 ]
31
32
33 # 用户代理
34 class USERAGENT(UserAgentMiddleware):
35 #初始化 注意一定是 user_agent=''
36 def __init__(self, user_agent=''):
37 self.user_agent = user_agent
38
39 def process_request(self, request, spider):
40 item = random.choice(UPPOOL)
41 try:
42 print("当前的User-Agent是:"+item)
43 request.headers.setdefault('User-Agent', item)
44 except Exception as e:
45 print(e)
46 pass
47
48
49 # 设置用户代理池
50 UPPOOL = [
51 "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393"
52 ]
五、settngs.py配置
1 COOKIES_ENABLED = False
2
3 DOWNLOADER_MIDDLEWARES = {
4 # 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware':123,
5 # 'demo4.middlewares.HTTPPROXY' : 125,
6 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 2,
7 'demo4.middlewares.USERAGENT': 1
8 }
9
10 ITEM_PIPELINES = {
11 'demo4.pipelines.Demo4Pipeline': 300,
12 }
六、定义爬取关注的数据(items.py文件)
1 # -*- coding: utf-8 -*-
2 import scrapy
3 # Define here the models for your scraped items
4 #
5 # See documentation in:
6 # http://doc.scrapy.org/en/latest/topics/items.html
7
8 class Demo4Item(scrapy.Item):
9 name = scrapy.Field()
10 link = scrapy.Field()
七、爬虫文件编写(test.py)
1 # -*- coding: utf-8 -*-
2 import scrapy
3 from scrapy.linkextractors import LinkExtractor
4 from scrapy.spiders import CrawlSpider, Rule
5 from demo4.items import Demo4Item
6
7 class TestSpider(CrawlSpider):
8 name = 'test'
9 allowed_domains = ['sohu.com']
10 start_urls = ['http://www.sohu.com/']
11
12 rules = (
13 Rule(LinkExtractor(allow=('http://news.sohu.com'), allow_domains=('sohu.com')), callback='parse_item',
14 follow=False),
15 # Rule(LinkExtractor(allow=('.*?/n.*?shtml'),allow_domains=('sohu.com')), callback='parse_item', follow=False),
16 )
17
18 def parse_item(self, response):
19 i = Demo4Item()
20 i['name'] = response.xpath('//div[@class="news"]/h1/a/text()').extract()
21 i['link'] = response.xpath('//div[@class="news"]/h1/a/@href').extract()
22 #i['description'] = response.xpath('//div[@id="description"]').extract()
23 return i
八、管道文件编写(pipelines.py)
1 # -*- coding: utf-8 -*-
2 import pymysql
3 import json
4 # Define your item pipelines here
5 #
6 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
7 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
8
9
10 class Demo4Pipeline(object):
11 def __init__(self):
12 # 数据库连接
13 self.conn = pymysql.connect(host='localhost', user='root', password='123456', database='chapter17', charset='utf8')
14 self.cur = self.conn.cursor()
15
16 def process_item(self, item, spider):
17 # 排除空值
18 for j in range(0, len(item["name"])):
19 nam = item["name"][j]
20 lin = item["link"][j]
21 print(type(nam))
22 print(type(lin))
23 # 注意参数化编写
24 sql = "insert into site(name,link) values(%s,%s)"
25 self.cur.execute(sql,(nam,lin))
26 self.conn.commit()
27 return item
28 def close_spider(self, spider):
29 self.cur.close()
30 self.conn.close()
爬取拉钩网的案例
spider.py
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from LaGouSpider.items import LagouJobItemLoader, LagouspiderItem
import datetime
from LaGouSpider.utils.common import get_md5
class LagouSpider(CrawlSpider):
name = 'lagou'
allowed_domains = ['www.lagou.com']
start_urls = ['https://www.lagou.com/']
headers = {
"HOST": "www.lagou.com",
"Referer": "https://www.lagou.com",
'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"
}
custom_settings = {
"COOKIES_ENABLED": False,
"DOWNLOAD_DELAY": 1,
'DEFAULT_REQUEST_HEADERS': {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.8',
'Connection': 'keep-alive',
'Cookie': 'JSESSIONID=ABAAABAAAFCAAEGBC99154D1A744BD8AD12BA0DEE80F320; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; hasDeliver=0; _ga=GA1.2.1111395267.1516570248; _gid=GA1.2.1409769975.1516570248; user_trace_token=20180122053048-58e2991f-fef2-11e7-b2dc-525400f775ce; PRE_UTM=; LGUID=20180122053048-58e29cd9-fef2-11e7-b2dc-525400f775ce; index_location_city=%E5%85%A8%E5%9B%BD; X_HTTP_TOKEN=7e9c503b9a29e06e6d130f153c562827; _gat=1; LGSID=20180122055709-0762fae6-fef6-11e7-b2e0-525400f775ce; PRE_HOST=github.com; PRE_SITE=https%3A%2F%2Fgithub.com%2Fconghuaicai%2Fscrapy-spider-templetes; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2F4060662.html; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1516569758,1516570249,1516570359,1516571830; _putrc=88264D20130653A0; login=true; unick=%E7%94%B0%E5%B2%A9; gate_login_token=3426bce7c3aa91eec701c73101f84e2c7ca7b33483e39ba5; LGRID=20180122060053-8c9fb52e-fef6-11e7-a59f-5254005c3644; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1516572053; TG-TRACK-CODE=index_navigation; SEARCH_ID=a39c9c98259643d085e917c740303cc7',
'Host': 'www.lagou.com',
'Origin': 'https://www.lagou.com',
'Referer': 'https://www.lagou.com/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
}
}
rules = (
Rule(LinkExtractor(allow=r'jobs/\d+.html'), callback='parse_job', follow=True),
)
def parse_job(self, response):
#解析拉钩网的职位
item_loader = LagouJobItemLoader(item=LagouspiderItem(), response=response)
item_loader.add_css("title", ".job-name::attr(title)")
item_loader.add_value("url", response.url)
item_loader.add_value("url_object_id", get_md5(response.url))
item_loader.add_css("salary", ".job_request .salary::text")
item_loader.add_xpath("job_city", "//*[@class='job_request']/p/span[2]/text()")
item_loader.add_xpath("work_years", "//*[@class='job_request']/p/span[3]/text()")
item_loader.add_xpath("degree_need", "//*[@class='job_request']/p/span[4]/text()")
item_loader.add_xpath("job_type", "//*[@class='job_request']/p/span[5]/text()")
item_loader.add_css("tags", '.position-label li::text')
item_loader.add_css("publish_time", ".publish_time::text")
item_loader.add_css("job_advantage", ".job-advantage p::text")
item_loader.add_css("job_desc", ".job_bt div")
item_loader.add_css("job_address", ".work_addr")
item_loader.add_css("company_name", "#job_company dt a img::attr(alt)")
item_loader.add_css("company_url", "#job_company dt a::attr(href)")
item_loader.add_value("crawl_time", datetime.datetime.now())
job_item = item_loader.load_item()
return job_item
items.py
import scrapy
from scrapy.loader.processors import MapCompose, TakeFirst, Join
from scrapy.loader import ItemLoader
from w3lib.html import remove_tags
from LaGouSpider.settings import SQL_DATETIME_FORMAT
class LagouJobItemLoader(ItemLoader):
#自定义Itemloader
default_output_processor = TakeFirst()
def remove_splash(value):
#去掉斜杠
return value.replace("/","")
def handle_jobaddr(value):
addr_list = value.split("\n")
addr_list = [item.strip() for item in addr_list if item.strip()!="查看地图"]
return "".join(addr_list)
class LagouspiderItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
url_object_id = scrapy.Field()
salary = scrapy.Field()
job_city = scrapy.Field(
input_processor=MapCompose(remove_splash),
)
work_years = scrapy.Field(
input_processor=MapCompose(remove_splash),
)
degree_need = scrapy.Field(
input_processor=MapCompose(remove_splash),
)
job_type = scrapy.Field()
publish_time = scrapy.Field()
job_advantage = scrapy.Field()
job_desc = scrapy.Field()
job_address = scrapy.Field(
input_processor=MapCompose(remove_tags, handle_jobaddr),
)
company_name = scrapy.Field()
company_url = scrapy.Field()
tags = scrapy.Field(
input_processor=Join(",")
)
crawl_time = scrapy.Field()
def get_insert_sql(self):
insert_sql = """
insert into lagou_job(title, url, url_object_id, salary, job_city, work_years, degree_need,
job_type, publish_time, job_advantage, job_desc, job_address, company_name, company_url,
tags, crawl_time) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
ON DUPLICATE KEY UPDATE salary=VALUES(salary), job_desc=VALUES(job_desc)
"""
params = (
self["title"], self["url"], self["url_object_id"], self["salary"], self["job_city"],
self["work_years"], self["degree_need"], self["job_type"],
self["publish_time"], self["job_advantage"], self["job_desc"],
self["job_address"], self["company_name"], self["company_url"],
self["tags"], self["crawl_time"].strftime(SQL_DATETIME_FORMAT),
)
return insert_sql, params
pipeline.py
from twisted.enterprise import adbapi
import MySQLdb
import MySQLdb.cursors
class LagouspiderPipeline(object):
def process_item(self, item, spider):
return item
class MysqlTwistedPipeline(object):
def __init__(self, dbpool):
self.dbpool = dbpool
@classmethod
def from_settings(clsc,setting):
dbparms = dict(
host =setting["MYSQL_HOST"],
db = setting["MYSQL_DBNAME"],
user = setting["MYSQL_USER"],
password = setting["MYSQL_PASSWORD"],
charset = 'utf8',
cursorclass = MySQLdb.cursors.DictCursor,
use_unicode = True,
)
dbpool = adbapi.ConnectionPool("MySQLdb",**dbparms)
return clsc(dbpool)
def process_item(self, item, spider):
#使用twisted将mysql插入变成异步执行
query = self.dbpool.runInteraction(self.do_insert,item)
query.addErrback(self.handle_error,item,spider) #处理异常
def handle_error(self,failure,item,spider):
#处理异步插入的异常
print(failure)
def do_insert(self,cursor,item):
#执行具体的插入
# 根据不同的item 构建不同的sql语句并插入到mysql中
insert_sql,params = item.get_insert_sql()
cursor.execute(insert_sql, params)