首页
学习
活动
专区
工具
TVP
发布
精选内容/技术社群/优惠产品,尽在小程序
立即前往

用Scrapy爬笑话存mongodb

[参考官方文档](https://docs.scrapy.org/en/latest/intro/tutorial.html)

pip3 install scrapy

scrapy startproject spider

cd spider/spiders

vim joke_spider.py

```python

# joke_spider.py

importscrapy

frompyqueryimportPyQueryasQ

#from urllib.parse import quote_plus

importos

frompymongoimportMongoClient

host =''

uri ="mongodb://$"

client = MongoClient(uri)

db = client.assets

collection = db.joke

classJokeSpider(scrapy.Spider):

name ="joke"

num =

start_urls = [

'http://www.jokeji.cn/jokehtml/xy/2018111222052837.htm',

]

defparse(self,response):

forpinresponse.css('span#text110 p'):

num =self.num =self.num +1

joke = Q(p.extract()).text()[2:]

lenth =len(joke)

collection.insert_one({

'num': num,

'joke': joke,

'lenth': lenth

})

next_page = response.css('div.zw_page1 a::attr(href)').extract_first()

ifnext_pageisnotNoneandself.num

yieldresponse.follow(next_page,callback=self.parse)

```

[解决输出为unicode编码问题](https://stackoverflow.com/questions/36936403/scrapy-convert-from-unicode-to-utf-8)

# settings.py

from scrapy.exporters import JsonLinesItemExporter

class MyJsonLinesItemExporter(JsonLinesItemExporter):

def__init__(self, file, **kwargs):

super(MyJsonLinesItemExporter, self).__init__(file, ensure_ascii=False, **kwargs)

FEED_EXPORTERS = {

'jsonlines': 'myproject.settings.MyJsonLinesItemExporter',

'jl': 'myproject.settings.MyJsonLinesItemExporter',

}

ipython用于测试爬取对象

pip3 install ipython

vim scrapy.cfg

[settings]

shell = ipython

mongodb 开启远程连接

vim /etc/mongod.conf

bind_ip 0.0.0.0

[mongodb python driver官方文档](http://api.mongodb.com/python/current/tutorial.html)

最后:

scrapy crawl joke

[源码链接](https://github.com/iamtmoe/scrapy.git)

每次都不够三百字烦死

每次都不够三百字烦死

每次都不够三百字烦死

  • 发表于:
  • 原文链接https://kuaibao.qq.com/s/20181114G124GD00?refer=cp_1026
  • 腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号(企鹅号)传播渠道之一,根据《腾讯内容开放平台服务协议》转载发布内容。
  • 如有侵权,请联系 cloudcommunity@tencent.com 删除。

扫码

添加站长 进交流群

领取专属 10元无门槛券

私享最新 技术干货

扫码加入开发者社群
领券