我正在尝试编码从URL下载PDF的代码。我找到了这样做的方法,但它不是用Python3编写的,而是使用了file()
函数。
我尝试在fp = open(path, 'rb')
行中将其替换为open()
。
然而,我得到了这个错误:
TypeError: expected str, bytes or os.PathLike object, not HTTPResponse.
任何帮助都将不胜感激。代码如下:
import bs4 as bs
import urllib.request
from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LAParams
from io import StringIO
from io import open
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
stri = retstr.getvalue()
retstr.close()
return stri
pdfFile = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf");
outputString = convert_pdf_to_txt(pdfFile)
print(outputString)
pdfFile.close()
使用的资源
http://zempirians.com/ebooks/Ryan%20Mitchell-Web%20Scraping%20with%20Python_%20Collecting%20Data%20from%20the%20Modern%20Web-O'Reilly%20Media%20(2015).pdf (第101页)
Extracting text from a PDF file using PDFMiner in python? (最好的答案)
发布于 2018-02-18 12:43:00
这样做(您需要从HTTP响应对象中获取字节):
pdfResponse = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf");
outputString = convert_pdf_to_txt(pdfResponse.read())
请参阅https://docs.python.org/3/library/http.client.html#httpresponse-objects
但是,您必须修改convert_pdf_to_txt
函数以接受原始数据作为输入,而不是文件对象,即,而不是
def convert_pdf_to_txt(path):
fp = open(path, 'rb')
...
for page in PDFPage.get_pages(fp, ...)
你需要做的是:
def convert_pdf_to_txt(rawbytes):
import io
fp = io.BytesIO(rawbytes)
...
for page in PDFPage.get_pages(fp, ...)
io.BytesIO
可以帮助您将字节数据转换为类似文件的字节流(https://docs.python.org/3/library/io.html#binary-i-o),这样您以后就可以将其伪装成一个文件。
我以前没有使用过PDF库,但是您可以从这个方向开始。
发布于 2018-02-18 15:55:31
与其纠结于过时的pdfminer
版本,我建议使用pdfminer.six
,它是pdfminer
库的更新分支,与Python3兼容。
pip install pdfminer.six
您必须编辑一些import
语句,但在大多数情况下,较新的fork是一个临时的替代品。
因此,现在,在读取HTTP响应的主体之后(根据Adrian Tam的建议),您就得到了一个PDF对象。然后,您可以使用该对象作为参数调用转换方法:
def convert_pdf_to_txt(pdf_obj):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
fp = BytesIO(pdf_obj) #get a file-like binary object
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
stri = retstr.getvalue()
retstr.close()
print(stri)
https://stackoverflow.com/questions/48848465
复制相似问题