我寻找我的问题,但在两个可用的问题中没有得到我的答案。
基本上,我想迭代每个页面,因为我只想选择具有特定文本的页面。
我用过pyPdf。它适用于几乎我可以说是90%的pdfs,但有时它不能从页面中提取信息。
我使用了以下代码:
import pyPdf
extract = ""
pdf = pyPdf.PdfFileReader(open('filename.pdf', "rb"))
num_of_pages = pdf.getNumPages()
for p in range(num_of_pages):
ex = pdf.getPage(6)
ex = ex.extractText()
if re.search(r"to be held (at|on)",ex.lower()):
print 'yes'
print ex ,"\n"
extract = extract + ex + "\n"
continue上面的代码可以工作,但有时一些页面不被解压。
我也尝试过使用pdfminer,但是我找不到如何逐页迭代它中的pdf。pdfminer返回pdf的全部文本。
我使用了以下代码:
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text在上面的代码中,pdf中的文本来自for循环。
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()在这种情况下,我如何一次迭代一个页面。
关于pdfminer的文档是不可理解的。同样,也有许多相同的版本。
那么,是否还有其他软件包可用于我的问题,或者是否可以使用pdfminer?
发布于 2016-01-06 11:50:06
我知道回答你自己的问题是不好的,但我想我可能已经想出了这个问题的答案。
我认为这不是最好的方法,但它仍然帮助我。
我使用了pypdf和pdfminer的组合
守则如下:
import pyPdf
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
path = "filename.pdf"
pdf = pyPdf.PdfFileReader(open(path, "rb"))
fp = file(path, 'rb')
num_of_pages = pdf.getNumPages()
extract = ""
for i in range(num_of_pages):
inside = [i]
pagenos=set(inside)
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
text = ""
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
text = text.decode("ascii","replace")
if re.search(r"to be held (at|on)",text.lower()):
print text
extract = extract + text + "\n"
continue也许有一个更好的方法,但目前我发现这是相当不错的。
发布于 2018-11-24 01:54:54
因为retstr将保留每个页面,所以您可能会考虑通过调用retstr.truncate(0)来更改代码,后者每次清除字符串,否则将打印每次已读取的全部内容:
import pyPdf
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
path = "filename.pdf"
pdf = pyPdf.PdfFileReader(open(path, "rb"))
fp = file(path, 'rb')
num_of_pages = pdf.getNumPages()
extract = ""
for i in range(num_of_pages):
inside = [i]
pagenos=set(inside)
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
text = ""
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
retstr.truncate(0)
text = text.decode("ascii","replace")
if re.search(r"to be held (at|on)",text.lower()):
print text
extract = extract + text + "\n"
continue发布于 2020-11-24 08:22:24
您可以参考以下链接,逐页从PDF中提取文本。
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages("test.pdf"):
for element in page_layout:
if isinstance(element, LTTextContainer):
print(element.get_text())https://stackoverflow.com/questions/34591770
复制相似问题