文档中有很多中文字符段落,需要全部删掉,可以用Deepseek强大的编程能力一下搞定。
在Deepseek中输入提示词:
写一个Python脚本,完成任务如下:
打开文件夹:E:\Penguins Spy in the Huddle
读取里面的srt文件,
删除里面所有的中文段落,包括其中的中文字符、标点符号等。
举个例子:
原文里面是:
“00:00:03,531 --> 00:00:08,270
趣怪外表的背后,是奇异性情
behind their feisty charm lies an amazing character.”
删除后是:
“00:00:03,531 --> 00:00:08,270
behind their feisty charm lies an amazing character.”
删除完成后,另存为一个新文档,编码为UTF-8
注意:每一步都要输出信息到屏幕上
当检测到中文段落时,确保只删除包含中文的行,而不是跳过后续的所有行。
源代码:
import os
import re
import chardet
def is_chinese(text):
"""判断文本是否包含中文字符"""
return re.search(r'[\u4e00-\u9fff]', text)
def detect_file_encoding(file_path):
"""检测文件的编码"""
with open(file_path, 'rb') as file:
raw_data = file.read()
result = chardet.detect(raw_data)
return result['encoding']
def remove_chinese_from_srt(file_path, output_path):
"""从srt文件中删除中文段落"""
# 检测文件编码
encoding = detect_file_encoding(file_path)
print(f"Detected encoding for {file_path}: {encoding}")
with open(file_path, 'r', encoding=encoding) as file:
lines = file.readlines()
new_lines = []
for line in lines:
# 保留时间戳行和空行
if re.match(r'\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}', line) or line.strip() == "":
new_lines.append(line)
# 删除包含中文的行
elif not is_chinese(line):
new_lines.append(line)
with open(output_path, 'w', encoding='utf-8') as file:
file.writelines(new_lines)
def process_folder(folder_path):
"""处理文件夹中的所有srt文件"""
for root, dirs, files in os.walk(folder_path):
for file in files:
if file.endswith('.srt'):
file_path = os.path.join(root, file)
output_path = os.path.join(root, f"processed_{file}")
print(f"Processing file: {file_path}")
remove_chinese_from_srt(file_path, output_path)
print(f"Saved processed file to: {output_path}")
if __name__ == "__main__":
folder_path = r"E:\Penguins Spy in the Huddle"
print(f"Starting to process folder: {folder_path}")
process_folder(folder_path)
print("Processing completed.")
文档处理完成:
本文分享自 Dance with GenAI 微信公众号,前往查看
如有侵权,请联系 cloudcommunity@tencent.com 删除。
本文参与 腾讯云自媒体同步曝光计划 ,欢迎热爱写作的你一起参与!