Pyhon 爬虫框架 looter

知名的pyspider,scrapy就不说了,今天说说这个 looter。


先安装好python3,需要3.6以上,然后执行 pip install looter

λ looter -hLooter, a python package designed for web crawler lovers :)Author: alphardex  QQ:2582347430If any suggestion, please contact me.Thank you for cooperation!
Usage:  looter genspider <name> [--async]  looter shell [<url>]  looter (-h | --help | --version)
Options:  -h --help        Show this screen.  --version        Show version.  --async          Use async instead of concurrent.


λ looter shell
Available objects:    url           The url of the site you crawled.    res           The response of the site.    tree          The element source tree to be parsed.
Available functions:    fetch         Send HTTP request to the site and parse it as a tree. [has async version]    view          View the page in your browser. (test rendering)    links         Get the links of the page.    save          Save what you crawled as a file. (json or csv)
Examples:    Get all the <li> elements of a <ul> table:        >>> items = tree.css('ul li')
    Get the links with a regex pattern:        >>> items = links(res, pattern=r'.*/(jpeg|image)/.*')
For more info, plz refer to documentation:    [looter]: imgs = tree.css('a.directlink::attr(href)').extract()>>> imgs[1:10]['', '', '', '', '', '', '', '', '']Path('konachan.txt').write_text('\n'.join(imgs))wget -i konachan.txt

抓取 v2

import timeimport looter as ltfrom pprint import pprintfrom concurrent import futures
domain = ''total = []

def crawl(url):    tree = lt.fetch(url)    items = tree.css('#TopicsNode .cell')    for item in items:        data = {}        data['title'] = item.css('span.item_title a::text').extract_first()        data['author'] = item.css('span.small.fade strong a::text').extract_first()        data['source'] = f"{domain}{item.css('span.item_title a::attr(href)').extract_first()}"        reply = item.css('a.count_livid::text').extract_first()        data['reply'] = int(reply) if reply else 0        pprint(data)        total.append(data)    time.sleep(1)

if __name__ == '__main__':    tasklist = [f'{domain}/go/python?p={n}' for n in range(1, 10)]    [crawl(task) for task in tasklist], name='v2ex.csv', sort_by='reply', order='desc')


,author,reply,source,title0,chinesehuazhou,127,,10 行 Python 代码,批量压缩图片 500 张,简直太强大了(内有公号宣传,不喜勿进)1,chinesehuazhou,103,,len(x) 击败 x.len(),从内置函数看 Python 的设计思想(内有公号宣传,不喜勿进)2,nfroot,73,,面对 Python 的强大和难用性表示深深的迷茫,莫非打开方式不对?3,css3,58,,你们用什么工具来管理 Python 的库啊?4,Northxw,54,,花式反爬之某众点评网5,akmonde,48,,Python 项目移植到其他机器,要求全 Linux 系统适配6,kayseen,47,,这道 Python 题目有大神会做吗?7,hellomacos,41,,老生常谈的问题:如何学好 Python

原文发布于微信公众号 - 苏生不惑(susheng_buhuo)





0 条评论
登录 后参与评论