Deepseek批量删除文档中的中文字符段落

AIGC部落

发布于 2025-03-03 12:52:34

2630

文章被收录于专栏：Dance with GenAIDance with GenAI

文档中有很多中文字符段落，需要全部删掉，可以用Deepseek强大的编程能力一下搞定。

在Deepseek中输入提示词：

写一个Python脚本，完成任务如下：

打开文件夹：E:\Penguins Spy in the Huddle

读取里面的srt文件，

删除里面所有的中文段落，包括其中的中文字符、标点符号等。

举个例子：

原文里面是：

“00:00:03,531 --> 00:00:08,270

趣怪外表的背后,是奇异性情

behind their feisty charm lies an amazing character.”

删除后是：

“00:00:03,531 --> 00:00:08,270

behind their feisty charm lies an amazing character.”

删除完成后，另存为一个新文档，编码为UTF-8

注意：每一步都要输出信息到屏幕上

当检测到中文段落时，确保只删除包含中文的行，而不是跳过后续的所有行。

源代码：

import os

import re

import chardet

def is_chinese(text):

"""判断文本是否包含中文字符"""

return re.search(r'[\u4e00-\u9fff]', text)

def detect_file_encoding(file_path):

"""检测文件的编码"""

with open(file_path, 'rb') as file:

raw_data = file.read()

result = chardet.detect(raw_data)

return result['encoding']

def remove_chinese_from_srt(file_path, output_path):

"""从srt文件中删除中文段落"""

# 检测文件编码

encoding = detect_file_encoding(file_path)

print(f"Detected encoding for {file_path}: {encoding}")

with open(file_path, 'r', encoding=encoding) as file:

lines = file.readlines()

new_lines = []

for line in lines:

# 保留时间戳行和空行

if re.match(r'\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}', line) or line.strip() == "":

new_lines.append(line)

# 删除包含中文的行

elif not is_chinese(line):

new_lines.append(line)

with open(output_path, 'w', encoding='utf-8') as file:

file.writelines(new_lines)

def process_folder(folder_path):

"""处理文件夹中的所有srt文件"""

for root, dirs, files in os.walk(folder_path):

for file in files:

if file.endswith('.srt'):

file_path = os.path.join(root, file)

output_path = os.path.join(root, f"processed_{file}")

print(f"Processing file: {file_path}")

remove_chinese_from_srt(file_path, output_path)

print(f"Saved processed file to: {output_path}")

if __name__ == "__main__":

folder_path = r"E:\Penguins Spy in the Huddle"

print(f"Starting to process folder: {folder_path}")

process_folder(folder_path)

print("Processing completed.")

文档处理完成：

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2025-03-02，如有侵权请联系 cloudcommunity@tencent.com 删除

file

本文分享自 Dance with GenAI 微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

0 条评论

热度

Deepseek批量删除文档中的中文字符段落

Deepseek批量删除文档中的中文字符段落

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐