我按照页面上的教程从pdf中提取文本:
http://www.blog.pythonlibrary.org/2018/06/07/an-intro-to-pypdf2/
我可以打印pdf信息,但不能打印页面的内容。它不会抛出任何错误,但我也看不到pdf的文本
可能的问题是什么?
from PyPDF2 import PdfFileReader
def get_info(path):
with open(path, 'rb') as f:
pdf = PdfFileReader(f)
info = pdf.getDocumentInfo()
number_of_pages = pdf.getNumPages()
#print(info)
author = info.author
creator = info.creator
producer = info.producer
subject = info.subject
title = info.title
print(author)
print(creator)
print(producer)
print(subject)
print(title)
def text_extractor(path):
with open(path, 'rb') as f:
pdf = PdfFileReader(f)
# get the first page
page = pdf.getPage(0)
print(page)
print('Page type: {}'.format(str(type(page))))
text = page.extractText()
print(text) #THIS PART SHOULD PRINT TEXT FROM PDF, BUT DOESNT WORK
if __name__ == '__main__':
#URL PDF: https://oficinavirtual.ugr.es/apli/solicitudPAU/test.pdf
path = 'test.pdf'
get_info(path)
print("\n"*2)
text_extractor(path)
发布于 2019-11-16 17:39:25
尽管这不是解决方案,但您可以简单地使用pip安装pdfminer3
并使用最小的可重现示例here。
https://stackoverflow.com/questions/57879273
复制相似问题