AI网络爬虫：批量下载某个网页中的全部链接

AIGC部落

发布于 2024-07-10 13:57:48

940

发布于 2024-07-10 13:57:48

文章被收录于专栏：Dance with GenAI

网页如下，有多个链接：

找到其中的a标签：

产品优势

</a>

在deepseek中输入提示词：

你是一个Python编程专家，要完成一个百度搜索页面爬取的Python脚本，具体任务如下：

解析网页：https://cloud.tencent.com/document/product/1093

定位class="rno-learning-path-wrap"的div标签；

然后定位div标签中所有a标签，提取title属性值作为网页文件名，提取href属性值作为网页下载地址，下载网页，保存网页到文件夹：F:\aivideo\腾讯云语音识别

注意：

每一步都要输出信息到屏幕上

每下载一个网页后，随机暂停3-6秒；

设置请求标头：

Accept:

text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7

Accept-Encoding:

gzip, deflate, br, zstd

Accept-Language:

zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6

Cache-Control:

max-age=0

Priority:

u=0, i

Referer:

https://cloud.tencent.com/product/asr?from_column=20421&from=20421

Sec-Ch-Ua:

" Not A;Brand";v="99", "Chromium";v="127", "Google Chrome";v="127"

Sec-Ch-Ua-Mobile:

Sec-Ch-Ua-Platform:

"Windows"

Sec-Fetch-Dest:

document

Sec-Fetch-Mode:

navigate

Sec-Fetch-Site:

same-origin

Sec-Fetch-User:

Upgrade-Insecure-Requests:

User-Agent:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36

Deepseek生成的源代码：

import requests

from bs4 import BeautifulSoup

import os

import time

import random

# 设置请求头

headers = {

"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",

"Accept-Encoding": "gzip, deflate, br, zstd",

"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6",

"Cache-Control": "max-age=0",

"Priority": "u=0, i",

"Referer": "https://cloud.tencent.com/product/asr?from_column=20421&from=20421",

"Sec-Ch-Ua": '" Not A;Brand";v="99", "Chromium";v="127", "Google Chrome";v="127"',

"Sec-Ch-Ua-Mobile": "?0",

"Sec-Ch-Ua-Platform": '"Windows"',

"Sec-Fetch-Dest": "document",

"Sec-Fetch-Mode": "navigate",

"Sec-Fetch-Site": "same-origin",

"Sec-Fetch-User": "?1",

"Upgrade-Insecure-Requests": "1",

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36"

}

# 创建文件夹

output_folder = 'F:\\aivideo\\腾讯云语音识别'

if not os.path.exists(output_folder):

os.makedirs(output_folder)

# 请求网页

url = "https://cloud.tencent.com/document/product/1093"

response = requests.get(url, headers=headers)

response.encoding = 'utf-8'

# 解析网页

soup = BeautifulSoup(response.text, 'html.parser')

# 定位class="rno-learning-path-wrap"的div标签

div_tag = soup.find('div', class_='rno-learning-path-wrap')

# 定位div标签中所有a标签

a_tags = div_tag.find_all('a')

# 提取title属性值作为网页文件名，提取href属性值作为网页下载地址

for a_tag in a_tags:

title = a_tag.get('title')

href = a_tag.get('href')

if title and href:

# 输出信息到屏幕上

print(f"Found link: {title} - {href}")

# 下载网页

try:

response = requests.get(href, headers=headers)

response.encoding = 'utf-8'

file_path = os.path.join(output_folder, f"{title}.html")

with open(file_path, 'w', encoding='utf-8') as file:

file.write(response.text)

print(f"Downloaded {href} to {file_path}")

except Exception as e:

print(f"Failed to download {href}: {e}")

# 随机暂停3-6秒

time.sleep(random.uniform(3, 6))

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2024-07-01，如有侵权请联系 cloudcommunity@tencent.com 删除

href

本文分享自 Dance with GenAI 微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

0 条评论

热度

AI网络爬虫：批量下载某个网页中的全部链接

AI网络爬虫：批量下载某个网页中的全部链接

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐