前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >AI网络爬虫:批量下载某个网页中的全部链接

AI网络爬虫:批量下载某个网页中的全部链接

作者头像
AIGC部落
发布2024-07-10 13:57:48
940
发布2024-07-10 13:57:48
举报
文章被收录于专栏:Dance with GenAI

网页如下,有多个链接:

找到其中的a标签:

<a hotrep="doc.overview.modules.path.0.0.1" href="https://cloud.tencent.com/document/product/1093/35681" title="产品优势">

产品优势

</a>

在deepseek中输入提示词:

你是一个Python编程专家,要完成一个百度搜索页面爬取的Python脚本,具体任务如下:

解析网页:https://cloud.tencent.com/document/product/1093

定位class="rno-learning-path-wrap"的div标签;

然后定位div标签中所有a标签,提取title属性值作为网页文件名,提取href属性值作为网页下载地址,下载网页,保存网页到文件夹:F:\aivideo\腾讯云语音识别

注意:

每一步都要输出信息到屏幕上

每下载一个网页后,随机暂停3-6秒;

设置请求标头:

Accept:

text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7

Accept-Encoding:

gzip, deflate, br, zstd

Accept-Language:

zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6

Cache-Control:

max-age=0

Priority:

u=0, i

Referer:

https://cloud.tencent.com/product/asr?from_column=20421&from=20421

Sec-Ch-Ua:

" Not A;Brand";v="99", "Chromium";v="127", "Google Chrome";v="127"

Sec-Ch-Ua-Mobile:

?0

Sec-Ch-Ua-Platform:

"Windows"

Sec-Fetch-Dest:

document

Sec-Fetch-Mode:

navigate

Sec-Fetch-Site:

same-origin

Sec-Fetch-User:

?1

Upgrade-Insecure-Requests:

1

User-Agent:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36

Deepseek生成的源代码:

import requests

from bs4 import BeautifulSoup

import os

import time

import random

# 设置请求头

headers = {

"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",

"Accept-Encoding": "gzip, deflate, br, zstd",

"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6",

"Cache-Control": "max-age=0",

"Priority": "u=0, i",

"Referer": "https://cloud.tencent.com/product/asr?from_column=20421&from=20421",

"Sec-Ch-Ua": '" Not A;Brand";v="99", "Chromium";v="127", "Google Chrome";v="127"',

"Sec-Ch-Ua-Mobile": "?0",

"Sec-Ch-Ua-Platform": '"Windows"',

"Sec-Fetch-Dest": "document",

"Sec-Fetch-Mode": "navigate",

"Sec-Fetch-Site": "same-origin",

"Sec-Fetch-User": "?1",

"Upgrade-Insecure-Requests": "1",

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36"

}

# 创建文件夹

output_folder = 'F:\\aivideo\\腾讯云语音识别'

if not os.path.exists(output_folder):

os.makedirs(output_folder)

# 请求网页

url = "https://cloud.tencent.com/document/product/1093"

response = requests.get(url, headers=headers)

response.encoding = 'utf-8'

# 解析网页

soup = BeautifulSoup(response.text, 'html.parser')

# 定位class="rno-learning-path-wrap"的div标签

div_tag = soup.find('div', class_='rno-learning-path-wrap')

# 定位div标签中所有a标签

a_tags = div_tag.find_all('a')

# 提取title属性值作为网页文件名,提取href属性值作为网页下载地址

for a_tag in a_tags:

title = a_tag.get('title')

href = a_tag.get('href')

if title and href:

# 输出信息到屏幕上

print(f"Found link: {title} - {href}")

# 下载网页

try:

response = requests.get(href, headers=headers)

response.encoding = 'utf-8'

file_path = os.path.join(output_folder, f"{title}.html")

with open(file_path, 'w', encoding='utf-8') as file:

file.write(response.text)

print(f"Downloaded {href} to {file_path}")

except Exception as e:

print(f"Failed to download {href}: {e}")

# 随机暂停3-6秒

time.sleep(random.uniform(3, 6))

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2024-07-01,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 Dance with GenAI 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
语音识别
腾讯云语音识别(Automatic Speech Recognition,ASR)是将语音转化成文字的PaaS产品,为企业提供精准而极具性价比的识别服务。被微信、王者荣耀、腾讯视频等大量业务使用,适用于录音质检、会议实时转写、语音输入法等多个场景。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档