问将.doc文件大量转换为.docx
EN

Stack Overflow用户

提问于 2022-11-21 16:11:43

回答 1查看 37关注 0票数 1

我有大约120万个.doc文件(大约50 of大)，需要转换到.docx。到目前为止，我尝试通过win32com接口为Python使用Word，但它真的很慢(每秒1-2个文件)。有什么更快的方法来实现这一点吗？

编辑:到目前为止使用的im代码：

def convert_doc_to_docx():
    dir = "sampledir"
    word = win32com.client.Dispatch("Word.Application")
    word.visible = 0
    globData = glob.iglob(dir + "*.doc")
    totalFiles = len([name for name in os.listdir(dir) if os.path.isfile(os.path.join(dir, name))])
    for i, doc in enumerate(globData):
        in_file = os.path.abspath(doc)
        wb = word.Documents.Open(in_file)
        out_file = os.path.abspath(doc + "x")
        wb.SaveAs2(out_file, FileFormat=16)  # file format for docx
        wb.Close()
        os.remove(in_file)
        print(f"{i+1} von {totalFiles} Dateien bearbeitet!")

    word.Quit()

python

ms-word

回答 1

Stack Overflow用户

发布于 2022-11-21 20:04:33

正如其他评论者所建议的，wordconv似乎是一个很好的解决方案，而且比使用win32com快得多。对于1700个文件，传输时间为~389秒，或约为每个对象~.21秒。这一次在很大程度上取决于您的系统硬件，因为它涉及大量的读写操作以及一些转换的处理能力。我基本上用完了16 gen的内存和一个旧的6代i7。使用HDD可能会使它慢很多。即使以.21秒为单位，也要花费大约70个小时(如果它与我的机器上的速度相似的话)。但这是一个巨大的提高，每物体1-2秒，是10倍的长度。

我使用subprocess.Popen()在for循环中运行命令C:\\Program Files\\Microsoft Office\\root\\Office16\\Wordconv.exe -oice -nme srcfile dstfile。

虽然调用子进程的推荐方法是subprocess.run()，但我使用了subprocess.Popen()，因为它不会等到进程完成后再继续。也许也有一种方法可以用subprocess.run来实现这一点，但我对它还不太熟悉。(也许有人能对此提供反馈)

import os
import subprocess
from timeit import default_timer as timer



def convert_doc_to_docx():
    
    src_dir = r"c:\Users\myuser\test"
    out_dir = "c:\\Users\\myuser\\test\\dst\\"
    all_files = [name for name in os.listdir(src_dir) if os.path.isfile(os.path.join(src_dir, name))]
    file_count = len(all_files)

    # change according to where "WordConv.exe" is located on your system
    path_to_wordconv = "C:\\Program Files\\Microsoft Office\\root\\Office16\\Wordconv.exe"

    print(f"Source dir file count: {file_count}")
    start = timer()
    for file in all_files:
        in_file_path = os.path.join(src_dir, file)
        out_file_path = out_dir + file + "x"

        # this will get process intensive 
        subprocess.Popen([f"{path_to_wordconv}","-oice","-nme",f"{in_file_path}",f"{out_file_path}"])
        
    end = timer()

    count_output_dir = len([name for name in os.listdir(out_dir) if os.path.isfile(os.path.join(out_dir, name))])    
    elapsed_time = end-start
    time_object = elapsed_time / count_output_dir

  
    print(f"Elapsed time: {elapsed_time} second")
    print(f"Time per object: {time_object} second")
    


    return
       

convert_doc_to_docx()

输出

Source dir file count: 1728
Elapsed time: 369.7448267 second
Time per object: 0.21397270063657406 second

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/74521779

复制

相似问题

问将.doc文件大量转换为.docx
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将.doc文件大量转换为.docxEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将.doc文件大量转换为.docx
EN