前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Python知乎专栏爬虫,pdfkit专栏文章制作PDF电子书

Python知乎专栏爬虫,pdfkit专栏文章制作PDF电子书

作者头像
二爷
发布2020-07-22 14:38:59
7460
发布2020-07-22 14:38:59
举报
文章被收录于专栏:二爷记二爷记

下面这篇文章,打算写个爬虫,使用pdfkit把专栏文章制作PDF电子书慢慢看!

专栏地址:

https://zhuanlan.zhihu.com/xdbcb8

目录页地址:

https://zhuanlan.zhihu.com/p/48373518

目录地址,就是爬取的入口

搞起来,fake_useragent库伪装ua协议头,发现十次有两三次被挂比,不愧是比乎,协议头验证得比较到位。

运行十次效果

附参考代码:

代码语言:javascript
复制
#https://zhuanlan.zhihu.com/p/48373518
#20200615 by 微信:huguo00289

# -*- coding: UTF-8 -*-
import requests,time
from fake_useragent import UserAgent
from lxml import etree


def get_urllist():
    ua=UserAgent()
    headers={
        'user-agent':ua.random,
    }
    url="https://zhuanlan.zhihu.com/p/48373518"
    response=requests.get(url,headers=headers,timeout=5)
    print(response.status_code)
    time.sleep(2)
    html=response.content.decode('utf-8')
    req=etree.HTML(html)
    hrefs=req.xpath('//div[@class="RichText ztext Post-RichText"]/ul//a/@href')
    print(hrefs)



if __name__=='__main__':
    for i in range(1,11):
        get_urllist()

没办法了,用自己浏览器的ua吧,要不然写报错?

暂时发现cookies头影响不大,把整个专栏文章爬取一次,没有发现异常,而且网页结构很规范,可能这里抓取的内容没有深入。

关键点:

1.etree把节点返回为html代码

代码语言:javascript
复制
h1=etree.tostring(h1,encoding='utf-8').decode('utf-8')

这里需要去调试输出正确的html代码

2.pdfkit的使用

与一样selenium需要进行安装配置

首先定义调用路径/地址

代码语言:javascript
复制
confg = pdfkit.configuration(wkhtmltopdf=r'C:\Users\Administrator\AppData\Local\Programs\Python\Python37\wkhtmltox\bin\wkhtmltopdf.exe')

配置参数

代码语言:javascript
复制
options = {
    'page-size': 'A4',
    'margin-top': '0.75in',
    'margin-right': '0.75in',
    'margin-bottom': '0.75in',
    'margin-left': '0.75in',
    'encoding': "UTF-8",
    'outline': None,
}
pdfkit.from_string(datas, r'out.pdf',options=options,configuration=confg)

运行效果:

PDF电子书效果:

附参考代码:

代码语言:javascript
复制
#https://zhuanlan.zhihu.com/p/48373518
#20200615 by 微信:huguo00289

# -*- coding: UTF-8 -*-
import requests,time
from fake_useragent import UserAgent
from lxml import etree
import pdfkit

confg = pdfkit.configuration(wkhtmltopdf=r'C:\Users\Administrator\AppData\Local\Programs\Python\Python37\wkhtmltox\bin\wkhtmltopdf.exe')

def get_urllist():
    ua=UserAgent()
    headers={
        'user-agent':ua.random,
        #'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
        #'cookie': 'SESSIONID=tPiQ2FpwANFw2tVR91VkmgEd5OHTxWSzjXpOB861nqq; JOID=UV8SBE3hul2soItWA-ULhKLMOToWiYsX3ufdBG-i8xTCkvkjdPW3u_ahj1YCSli_gQ4Fs3mJdFAfWYxxC2OzOdM=; osd=U14cBk7ju1Ouo4lXDecIhqPCOzkUiIUV3eXcCm2h8RXMkPohdfu1uPSggVQBSFmxgw0HsneLd1IeV45yCWK9O9A=; _zap=d55f3e6d-080a-4581-b3d3-c2a3698688aa; d_c0="AMCqkldLeg-PTtyw-z_gAIP5PcjeFBCdsJo=|1558685069"; __gads=ID=528a3696428a9d32:T=1558685297:S=ALNI_MZ9VGoTrHsNgTUcKt7Pw-nt0MfRZA; _xsrf=DJsB0m4gygVwX2u42LFqERf0llZT1t6X; tst=r; _ga=GA1.2.1869194376.1583723562; z_c0=Mi4xX09zZUdRQUFBQUFBd0txU1YwdDZEeGNBQUFCaEFsVk5Hc21DWHdDTjlBUXhJOHZEUmhLRTdyMUYxcnVLblc5Xzd3|1586854682|2b1575f2d3331cb3eb1327b1e6e0afd8fa7fe5fd; __utma=51854390.1869194376.1583723562.1586158626.1588234855.7; __utmz=51854390.1588234855.7.7.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/; __utmv=51854390.100-1|2=registration_date=20200213=1^3=entry_date=20190524=1; q_c1=116807e43ab24320baa102068d5541f3|1591838671000|1558685083000; _gid=GA1.2.1281198261.1592183078; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1592189701,1592193127,1592199976,1592200401; KLBRSID=975d56862ba86eb589d21e89c8d1e74e|1592200462|1592200461; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1592200463',
    }
    url="https://zhuanlan.zhihu.com/p/48373518"
    response=requests.get(url,headers=headers,timeout=5)
    print(response.status_code)
    time.sleep(2)
    html=response.content.decode('utf-8')
    req=etree.HTML(html)
    hrefs=req.xpath('//div[@class="RichText ztext Post-RichText"]/ul//a/@href')
    print(hrefs)
    return hrefs


def get_content(url):
    headers = {
         'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
    }
    response = requests.get(url, headers=headers, timeout=5)
    print(response.status_code)
    time.sleep(2)
    html = response.content.decode('utf-8')
    req=etree.HTML(html)
    h1=req.xpath('//h1[@class="Post-Title"]')[0]
    h1=etree.tostring(h1,encoding='utf-8').decode('utf-8')
    #print(h1)
    article=req.xpath('//div[@class="RichText ztext Post-RichText"]')[0]
    article = etree.tostring(article, encoding='utf-8').decode('utf-8')
    #print(article)
    content='%s%s'%(h1,article)
    print(content)

    return content


def dypdf(datas):
    #datas = f'<html><head><meta charset="UTF-8"></head><body>{datas}</body></html>'
    datas=f'''
    <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
    </head>
    <body>
    {datas}
    </body>
    </html>
    '''
    print("开始打印内容!")
    options = {
        'page-size': 'A4',
        'margin-top': '0.75in',
        'margin-right': '0.75in',
        'margin-bottom': '0.75in',
        'margin-left': '0.75in',
        'encoding': "UTF-8",
        'outline': None,
    }
    pdfkit.from_string(datas, r'out.pdf',options=options,configuration=confg)
    print("打印保存成功!")


def main():
    datas=''
    urls=get_urllist()
    for url in urls:
        content=get_content(url)
        datas='%s%s'%(datas,content)

    dypdf(datas)


if __name__=='__main__':
    main()

比较可惜的是专栏文章效果都是动图gif,打印成pdf图片都是无法显示!!

暂时没有找到解决方案!

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2020-06-15,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 Python与SEO学习 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档