首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >糗事百科_多进程_demo(3)

糗事百科_多进程_demo(3)

作者头像
zhengzongwei
发布2019-07-31 14:55:22
2410
发布2019-07-31 14:55:22
举报
文章被收录于专栏:Python | BlogPython | Blog

版权声明:Copyright © https://cloud.tencent.com/developer/article/1477128

import requests
from lxml import etree
from multiprocessing import Process
from multiprocessing import JoinableQueue as Queue


class QiubaiSpider:

    def __init__(self):
        self.temp_url = 'https://www.qiushibaike.com/8hr/page/{}/'
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36 QQBrowser/4.4.108.400'
        }
        self.url_q = Queue()
        self.html_q = Queue()
        self.content_q = Queue()

    def get_url(self):
        url_list = [self.temp_url.format(i) for i in range(1,14)]
        for url in url_list:
            self.url_q.put(url)

    def parse_url(self):
        while True:
            url = self.url_q.get()
            response = requests.get(url,headers=self.headers)
            self.html_q.put(response.content.decode())
            self.url_q.task_done()

    def get_html(self):
        while True:
            html_str = self.html_q.get()
            html = etree.HTML(html_str)
            div_list = html.xpath('//div[@id="content-left"]/div')
            content_list =list()
            for div in div_list:
                item={}
                text = div.xpath('.//div[@class="content"]/span/text()')
                author = div.xpath('.//h2/text()')
                item['author'] = author
                item['text'] = text

                content_list.append(item)
            self.content_q.put(content_list)
            self.html_q.task_done()

    def save_html(self):

        current = 0
        while True:
            content_list = self.content_q.get()
            for content in content_list:
                current += 1
                print(content)
                print(current)
            self.content_q.task_done()

    def run(self):
        self.get_url()
        process_list = []
        for i in range(3):
            p_parse = Process(target=self.parse_url)
            process_list.append(p_parse)

        p_html = Process(target=self.get_html)
        process_list.append(p_html)
        p_save = Process(target=self.save_html)
        process_list.append(p_save)

        for i in process_list:
            i.daemon = True
            i.start()

        for p in [self.url_q,self.html_q,self.content_q]:
            p.join()
        print('主进程结束')
if __name__ == '__main__':
    qiubai = QiubaiSpider()
    qiubai.run()
本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2018年07月17日,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档