一个简单的图片爬虫，Python图片采集下载

二爷

发布于 2020-07-22 14:34:50

1.2K0

发布于 2020-07-22 14:34:50

文章被收录于专栏：二爷记

一个非常简单的图片爬虫，通过一个页面的链接采集，然后访问单页面获取想要的大图，实现采集下载的目的，比较简单，由于是国外网站，访问会比较慢，推荐使用代理工具来实现。

目标网址：

https://thedieline.com/blog/2020/5/19/the-worlds-best-packaging-dieline-awards-2020-winners-revealed

网站没有什么反爬，页面结构也比较清晰以及简单，只要找准节点即可！

想要获取的链接有两个节点

节点一

xpath语法

hrefs=req.xpath('//p[@class="data-import-preserve"]/a/@href')

节点二

xpath语法

hrefs=req.xpath('//b[@class="data-import-preserve"]/a/@href')

通过以上两个节点应该能够获取到所有链接了，不过需要注意过滤一些无效链接，不然程序会报错或者出来无效数据。

图片下载超时处理

图片下载，做了一下超时处理，很简单的写法，try except处理，仅供参考。

爬取效果

采集效果

下载效果

附源码：

#thedieline采集
#20200520by 微信：huguo00289
# -*- coding: UTF-8 -*-
import requests,time,re
from fake_useragent import UserAgent
from lxml import etree
import os


def ua():
    ua = UserAgent()
    headers = {"User-Agent": ua.random}
    return headers

def get_urllist():
    url="https://thedieline.com/blog/2020/5/19/the-worlds-best-packaging-dieline-awards-2020-winners-revealed"
    response=requests.get(url,headers=ua(),timeout=8).content.decode('utf-8')
    req=etree.HTML(response)
    hrefs=req.xpath('//b[@class="data-import-preserve"]/a/@href')
    print(len(hrefs))
    return hrefs

def get_imgs(url):
    response = requests.get(url, headers=ua(), timeout=8).content.decode('utf-8')
    time.sleep(1)
    req = etree.HTML(response)
    title=req.xpath('//title/text()')[0]
    title=re.sub(r'[\|\/\<\>\:\*\?\\\"]', "_", title)  # 剔除不合法字符
    print(title)
    os.makedirs(f'{title}/',exist_ok=True) #创建目录
    imgs=req.xpath('//figure[@class="data-import-preserve"]/img/@src')
    print(len(imgs))
    i=1
    for img in imgs:
        img_url=img
        img_name=f'{i}.jpeg'
        bctp(title, img_url, img_name)
        i=i+1


#下载图片
def bctp(lj,img_url,img_name):
    print("开始下载图片！")
    try:
        r = requests.get(img_url,headers=ua(),timeout=5)
        with open(f'{lj}/{img_name}', 'wb') as f:
            f.write(r.content)
            print(f'下载{img_name}图片成功！')
            time.sleep(1)
    except Exception as e:
        if "port=443): Read timed out" in str(e):
            time.sleep(2)
            try:
                r = requests.get(img_url, headers=ua(),timeout=5)
                with open(f'{lj}/{img_name}', 'wb') as f:
                    f.write(r.content)
                    print(f'下载{img_name}图片成功！')
            except Exception as e:
                print(f'下载{img_name}图片失败！')
                print(f'错误代码：{e}')
                with open(f'{lj}/spider.txt', 'a+', encoding='utf-8') as f:
                    f.write(f'错误代码：{e}---下载 {img_url} 图片失败\n')
        else:
            print(f'下载{img_name}图片失败！')
            print(f'错误代码：{e}')
            with open(f'{lj}/spider.txt', 'a+', encoding='utf-8') as f:
                f.write(f'错误代码：{e}---下载 {img_url} 图片失败\n')


def run():
    hrefs=get_urllist()
    hrefs.remove("https://thedieline.com/blog/2020/5/6/-riceman")
    hrefs.remove("https://thedieline.com/blog/2020/3/6/srisangdao-rices-packaging-can-be-reused-as-tissue-box")
    hrefs.remove("https://thedieline.com/blog/2020/2/1/-revitalising-kelloggs")
    print(len(hrefs))
    for href in hrefs:
        if "https://thedieline.com" in href:
            print(f'>>>正在爬取{href}，采集中...')
            try:
                get_imgs(href)
            except:
                pass

    print(f'>>>采集完成！.')

if __name__=='__main__':
    run()

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2020-05-21，如有侵权请联系 cloudcommunity@tencent.com 删除

xslt & xpath

本文分享自 Python与SEO学习微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

xslt & xpath

登录后参与评论

0 条评论

热度

一个简单的图片爬虫，Python图片采集下载

一个简单的图片爬虫，Python图片采集下载

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐