文章/答案/技术大牛

发布

社区首页 >问答首页 >Web抓取错误(HTTP错误403:禁止)

问Web抓取错误(HTTP错误403:禁止)
EN

Stack Overflow用户

提问于 2020-03-15 23:42:49

回答 1查看 1.3K关注 0票数 2

我正在尝试做一个简单的程序，获得所有的网站上的图像地址，然后下载到一个文件夹中。问题是我得到了一个403错误。我已经试着修复它超过一个小时，迫切需要帮助。下面是我的代码：

import urllib.request
import requests
from bs4 import BeautifulSoup



url = 'https://www.webtoons.com/en/slice-of-life/how-to-love/ep-100-happy-ending-last-episode/viewer?title_no=472&episode_no=100'
data = requests.get(url)
code = BeautifulSoup(data.text, 'html.parser')




photos = []

def dl_jpg(url, filePath, fileName):
    fullPath = filePath + fileName + '.jpg'
    urllib.request.urlretrieve(url, fullPath)

for img in code.find('div', id='_imageList'):
    pic = str(img)[43:147]
    photos.append(str(pic))

for photo in photos:
    if photo == '':
        photos.remove(photo)

for photo in photos[0:-4]:
    dl_jpg(photo, 'images/', 'img')

python

image

beautifulsoup

automation

回答 1

Stack Overflow用户

发布于 2020-03-16 07:14:31

网站通常会阻止没有用户代理的请求。我更新了您的代码，以便随请求一起发送一个用户代理。我还选择只使用requests库，而不使用urllib。虽然urllib确实支持更改标题，但您已经在使用requests了，我对它更熟悉。

我还建议在请求之间增加一个延迟/睡眠，30-45秒是一个很好的数量。这将避免向网站发送垃圾邮件和创建拒绝服务。如果你发送的太多太快，一些网站也会阻止你的请求。

import requests
from bs4 import BeautifulSoup

user_agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.37"
url = 'https://www.webtoons.com/en/slice-of-life/how-to-love/ep-100-happy-ending-last-episode/viewer?title_no=472&episode_no=100'
data = requests.get(url, headers={'User-Agent': user_agent})
code = BeautifulSoup(data.text, 'html.parser')

photos = []

def dl_jpg(url, filePath, fileName):
    fullPath = filePath + fileName + '.jpg'

    # make request with user-agent. If request is successful then save the result.
    image_request = requests.get(url, headers={'User-Agent': user_agent})
    if image_request.status_code == 200:
        with open(fullPath, 'wb') as f:
            f.write(image_request.content)

for img in code.find('div', id='_imageList'):
    pic = str(img)[43:147]
    photos.append(str(pic))

for photo in photos:
    if photo == '':
        photos.remove(photo)

for photo in photos[0:-4]:
    dl_jpg(photo, 'images/', 'img')

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/60694611

复制

相似问题

问Web抓取错误(HTTP错误403:禁止)
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Web抓取错误(HTTP错误403:禁止)EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Web抓取错误(HTTP错误403:禁止)
EN