首页
学习
活动
专区
工具
TVP
发布
精选内容/技术社群/优惠产品,尽在小程序
立即前往

Python读取嵌入代码,提取url并将url标题写入新的csv文件

Python读取嵌入代码,提取URL并将URL标题写入新的CSV文件的过程可以通过以下步骤完成:

  1. 导入所需的Python库:
代码语言:txt
复制
import re
import csv
import requests
from bs4 import BeautifulSoup
  1. 定义一个函数来提取URL和标题:
代码语言:txt
复制
def extract_url_title(embedded_code):
    urls = re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', embedded_code)
    titles = []
    for url in urls:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        title = soup.title.string if soup.title else ''
        titles.append(title)
    return urls, titles
  1. 读取嵌入代码文件并调用函数提取URL和标题:
代码语言:txt
复制
embedded_code_file = 'embedded_code.txt'
output_file = 'output.csv'

with open(embedded_code_file, 'r') as file:
    embedded_code = file.read()

urls, titles = extract_url_title(embedded_code)
  1. 将提取的URL和标题写入CSV文件:
代码语言:txt
复制
with open(output_file, 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['URL', 'Title'])
    for url, title in zip(urls, titles):
        writer.writerow([url, title])

完整的Python代码如下:

代码语言:txt
复制
import re
import csv
import requests
from bs4 import BeautifulSoup

def extract_url_title(embedded_code):
    urls = re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', embedded_code)
    titles = []
    for url in urls:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        title = soup.title.string if soup.title else ''
        titles.append(title)
    return urls, titles

embedded_code_file = 'embedded_code.txt'
output_file = 'output.csv'

with open(embedded_code_file, 'r') as file:
    embedded_code = file.read()

urls, titles = extract_url_title(embedded_code)

with open(output_file, 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['URL', 'Title'])
    for url, title in zip(urls, titles):
        writer.writerow([url, title])

这段代码通过正则表达式提取嵌入代码中的URL,然后使用requests库发送HTTP请求获取网页内容。使用BeautifulSoup库解析网页内容,提取标题。最后,将URL和标题写入CSV文件中。

推荐的腾讯云相关产品:腾讯云对象存储(COS),用于存储和管理文件、图片、视频等静态资源。产品介绍链接地址:https://cloud.tencent.com/product/cos

页面内容是否对你有帮助?
有帮助
没帮助

相关·内容

没有搜到相关的结果

领券