我正在使用Python pdftotext来抓取一个PDF文件的文本。这很好,但是我需要命令行工具在pdftotext -layout pdf_file.pdf中提供的“pdftotext -layout pdf_file.pdf”选项。不确定不需要在代码中显式地使用命令,这是否可能。
实际代码:
pdf = pdftotext.PDF(file)
plain_text = "\n\n".join(pdf)为更好地抓取布局选项的理想代码:
pdf = pdftotext.PDF(file, "-layout")
plain_text = "\n\n".join(pdf)我想在Python程序中避免这样做:
cmd = ['pdftotext', '-f', str(1), '-l', str(1), str(pdf_file), '-layout', '-']谢谢!
发布于 2022-03-26 18:30:28
with open("file.pdf", "rb") as f:
pdf=pdftotext.PDF(f,physical=True)
Inside the code found:
" raw: If True, page text is output in the order it appears in the\n"
" content stream.\n"
" physical: If True, page text is output in the order it appearshttps://stackoverflow.com/questions/67106225
复制相似问题