[参考官方文档](https://docs.scrapy.org/en/latest/intro/tutorial.html)
pip3 install scrapy
scrapy startproject spider
cd spider/spiders
vim joke_spider.py
```python
# joke_spider.py
importscrapy
frompyqueryimportPyQueryasQ
#from urllib.parse import quote_plus
importos
frompymongoimportMongoClient
host =''
uri ="mongodb://$"
client = MongoClient(uri)
db = client.assets
collection = db.joke
classJokeSpider(scrapy.Spider):
name ="joke"
num =
start_urls = [
'http://www.jokeji.cn/jokehtml/xy/2018111222052837.htm',
]
defparse(self,response):
forpinresponse.css('span#text110 p'):
num =self.num =self.num +1
joke = Q(p.extract()).text()[2:]
lenth =len(joke)
collection.insert_one({
'num': num,
'joke': joke,
'lenth': lenth
})
next_page = response.css('div.zw_page1 a::attr(href)').extract_first()
ifnext_pageisnotNoneandself.num
yieldresponse.follow(next_page,callback=self.parse)
```
[解决输出为unicode编码问题](https://stackoverflow.com/questions/36936403/scrapy-convert-from-unicode-to-utf-8)
# settings.py
from scrapy.exporters import JsonLinesItemExporter
class MyJsonLinesItemExporter(JsonLinesItemExporter):
def__init__(self, file, **kwargs):
super(MyJsonLinesItemExporter, self).__init__(file, ensure_ascii=False, **kwargs)
FEED_EXPORTERS = {
'jsonlines': 'myproject.settings.MyJsonLinesItemExporter',
'jl': 'myproject.settings.MyJsonLinesItemExporter',
}
ipython用于测试爬取对象
pip3 install ipython
vim scrapy.cfg
[settings]
shell = ipython
mongodb 开启远程连接
vim /etc/mongod.conf
bind_ip 0.0.0.0
[mongodb python driver官方文档](http://api.mongodb.com/python/current/tutorial.html)
最后:
scrapy crawl joke
[源码链接](https://github.com/iamtmoe/scrapy.git)
每次都不够三百字烦死
每次都不够三百字烦死
每次都不够三百字烦死
领取专属 10元无门槛券
私享最新 技术干货