我有一个名为SOURCE的文件夹。此源文件夹有多个文件夹- A、B、C、D、E、F、G、H。所有这些文件夹都有多个PDF文件。我想从All中读取单个pdf文件-一个从A中,一个从B中,一个从C中,一个从D中,直到H。所以在所有的8个文件夹中,我想读取第一个pdf文件并从中提取文本数据。从1pdf中提取文本数据很好,但如何从多个pdf中提取文本数据?下面是提取单个pdf的文本数据的代码。
from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.converter import TextConverter
import io
import glob as g
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)
with open('F:/technophile/Proj/SOURCE/A/abc.pdf', 'rb') as fh:
for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
# close open handles
converter.close()
fake_file_handle.close()
print(text)发布于 2021-06-25 15:12:29
也许你可以试试这样的方法:
# your code
import os
folder = ['A','B','C','D','E','F','G','H']
allyourpdf = []
for fold in folder:
allyourfiles = os.listdir(fold)
firstpdf = ""
for i in allyourfiles:
if '.pdf' in i:
firstpdf = i
break
with open('F:/technophile/Proj/SOURCE/'+fold+firstpdf, 'rb') as fh:
for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
allyourpdf.append(text)
# your code我想它应该能行得通
https://stackoverflow.com/questions/68126847
复制相似问题