scrapy框架的简单使用
全局命令
和 项目命令
。全局命令
:在哪里都能使用。项目命令
:必须在爬虫项目里面才能使用。C:\Users\AOBO>scrapy -h
Scrapy 1.2.1 - no active project
使用格式:
scrapy <command> [options] [args]
可用的命令:
bench 测试本地硬件性能(工作原理:):scrapy bench
commands
fetch 取URL使用Scrapy下载
genspider 产生新的蜘蛛使用预先定义的模板
runspider 运用单独一个爬虫文件:scrapy runspider abc.py
settings 获取设置值
shell 进入交互终端,用于爬虫的调试(如果你不调试,那么就不常用):scrapy shell http://www.baidu.com --nolog(--nolog 不显示日志信息)
startproject 创建一个爬虫项目,如:scrapy startproject demo(demo 创建的爬虫项目的名字)
version 查看版本:(scrapy version)
view 下载一个网页的源代码,并在默认的文本编辑器中打开这个源代码:scrapy view http://www.aobossir.com/
[ more ] 从项目目录运行时可获得更多命令
使用 "scrapy <command> -h" 要查看有关命令的更多信息
D:\BaiduYunDownload\first>scrapy -h
Scrapy 1.2.1 - project: first
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
check Check spider contracts
commands
crawl 运行一个爬虫文件。:scrapy crawl f1 或者 scrapy crawl f1 --nolog
edit 使用编辑器打开爬虫文件 (Windows上似乎有问题,Linux上没有问题):scrapy edit f1
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
list 列出当前爬虫项目下所有的爬虫文件:scrapy list
parse Parse URL (using its spider) and print the results
runspider Run a self-contained spider (without creating a project)
settings 获取设置值
shell 进入交互终端,用于爬虫的调试(如果你不调试,那么就不常用)
startproject 创建一个爬虫项目,如:scrapy startproject demo(demo 创建的爬虫项目的名字)
version 查看版本:(scrapy version)
view 下载一个网页的源代码,并在默认的文本编辑器中打开这个源代码
Use "scrapy <command> -h" to see more info about a command
注意:Scrapy运行ImportError: No module named win32api错误。请安装:pip install pypiwin32
scrapy -h
scapy --help
(venv)ql@ql:~$ scrapy version
Scrapy 1.1.2
(venv)ql@ql:~$
(venv)ql@ql:~$ scrapy version -v
Scrapy : 1.1.2
lxml : 3.6.4.0
libxml2 : 2.9.4
Twisted : 16.4.0
Python : 2.7.12 (default, Jul 1 2016, 15:12:24) - [GCC 5.4.0 20160609]
pyOpenSSL : 16.1.0 (OpenSSL 1.0.2g-fips 1 Mar 2016)
Platform : Linux-4.4.0-36-generic-x86_64-with-Ubuntu-16.04-xenial
(venv)ql@ql:~$
scrapy startproject spider_name
scrapy genspider name domain
#如:
#scrapy genspider sohu sohu.org
scrapy list
scrapy view http://www.baidu.com
# 进入该url的交互环境
scrapy shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
response.xpath() #括号里直接加xpath路径
scrapy runspider 爬虫名称
scrapy startproject demo
demo
├── demo
│ ├── __init__.py
│ ├── __pycache__
│ ├── items.py # Items的定义,定义抓取的数据结构
│ ├── middlewares.py # 定义Spider和DownLoader的Middlewares中间件实现。
│ ├── pipelines.py # 它定义Item Pipeline的实现,即定义数据管道
│ ├── settings.py # 它定义项目的全局配置
│ └── spiders # 其中包含一个个Spider的实现,每个Spider都有一个文件
│ ├── __init__.py
│ └── __pycache__
└── scrapy.cfg #Scrapy部署时的配置文件,定义了配置文件路径、部署相关信息等内容
scrapy genspider fang fang.5i5j.com
$ tree
├── demo
│ ├── __init__.py
│ ├── __pycache__
│ │ ├── __init__.cpython-36.pyc
│ │ └── settings.cpython-36.pyc
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ ├── __pycache__
│ │ └── __init__.cpython-36.pyc
│ └── fang.py #在spiders目录下有了一个爬虫类文件fang.py
└── scrapy.cfg
# fang.py的文件代码如下:
# -*- coding: utf-8 -*-
import scrapy
class FangSpider(scrapy.Spider):
name = 'fang'
allowed_domains = ['fang.5i5j.com']
start_urls = ['http://fang.5i5j.com/']
def parse(self, response):
pass
import scrapy
class FangItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
address = scrapy.Field()
time = scrapy.Field()
clicks = scrapy.Field()
price = scrapy.Field()
#pass
# -*- coding: utf-8 -*-
import scrapy
from demo.items import FangItem
class FangSpider(scrapy.Spider):
name = 'fang'
allowed_domains = ['fang.5i5j.com']
#start_urls = ['http://fang.5i5j.com/']
start_urls = ['https://fang.5i5j.com/bj/loupan/']
def parse(self, response):
hlist = response.css("div.houseList_list")
for vo in hlist:
item = FangItem()
item['title'] = vo.css("h3.fontS20 a::text").extract_first()
item['address'] = vo.css("span.addressName::text").extract_first()
item['time'] = vo.re("<span>(.*?)开盘</span>")[0]
item['clicks'] = vo.re("<span><i>([0-9]+)</i>浏览</span>")[0]
item['price'] = vo.css("i.fontS24::text").extract_first()
#print(item)
yield item
class DemoPipeline(object):
def process_item(self, item, spider):
print(item)
return item
scrapy crawl fang
scrapy crawl fang -o fangs.json
scrapy crawl fang -o fangs.csv
scrapy crawl fang -o fangs.xml
scrapy crawl fang -o fangs.pickle
scrapy crawl fang -o fangs.marshal
FormRequest
来完成POST提交,并可以携带参数。http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule
youdao
有道的爬虫文件:scrapy genspider youdao fanyi.youdao.com
# -*- coding: utf-8 -*-
import scrapy,json
class YoudaoSpider(scrapy.Spider):
name = 'youdao'
allowed_domains = ['fanyi.youdao.com']
#start_urls = ['http://fanyi.youdao.com']
def start_requests(self):
url = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule'
keyword = input("请输入要翻译的单词:")
data = {'i':keyword,'doctype': 'json',}
# FormRequest 是Scrapy发送POST请求的方法
yield scrapy.FormRequest(
url = url,
formdata = data,
callback = self.parse
)
def parse(self, response):
res = json.loads(response.body)
print(res['translateResult'][0][0]['tgt'])
input("按任意键继续")
END
岁月有你 惜惜相处