wget和PDFFileReader -无法读取格式错误的PDF文件

wget是一个在命令行中使用的开源工具，用于从网络上下载文件。它支持通过HTTP、HTTPS和FTP等协议下载文件，并具有断点续传、递归下载、后台下载等功能。wget可以通过URL指定要下载的文件，并将其保存到本地。

PDFFileReader是Python语言中的一个类，属于PyPDF2库的一部分。它提供了一种读取和处理PDF文件的方法。PDFFileReader可以打开一个PDF文件并将其解析为一个可供读取的对象。通过这个对象，我们可以获取PDF文档的页面数量、页面内容、书签、元数据等信息。

当遇到一个格式错误的PDF文件时，无法使用PDFFileReader来读取该文件。这种情况通常是由于PDF文件的结构不正确、损坏或不符合PDF规范引起的。由于格式错误，PDFFileReader无法正确解析该文件，并抛出相应的错误。

对于wget和PDFFileReader无法读取格式错误的PDF文件的情况，我们可以采取以下解决方案：

检查PDF文件是否真正损坏：可以尝试使用其他的PDF阅读器软件（如Adobe Acrobat Reader）来打开该文件，以确认是否是文件本身存在问题导致无法正常读取。
修复损坏的PDF文件：可以尝试使用一些在线的PDF修复工具来修复损坏的PDF文件，例如PDF Repair Kit、PDFaid等。这些工具可以尝试恢复文件结构，使其可以正常读取。
跳过无法读取的PDF文件：如果遇到无法读取的PDF文件，可以在程序中进行错误处理，使其跳过该文件并继续处理其他正常的PDF文件。

请注意，以上方法仅供参考，具体的解决方案应根据具体情况而定。

在腾讯云的产品中，没有特定与wget和PDFFileReader直接相关的产品。然而，腾讯云提供了丰富的云计算产品和服务，包括云服务器、云数据库、云存储、人工智能等，可以满足云计算领域的需求。具体产品和服务的介绍和链接地址，请参考腾讯云官方网站（https://cloud.tencent.com/）。

PyPDF2写入损坏文件

、、

有一些问题w/ PyPDF2 -特别是在分割和重写文件！我正在(我的ubuntu服务器)上打开一个文件，将它分割成单独的页面(最多3页)，并将写入文件系统(然后放入S3)。写入文件时不会引发错误，但从S3下载时我无法打开它，正如您将在下面看到的那样，无法在服务器上打开它。有什么想法吗？ inputpdf = PdfFileReader(open(fi, 'rb')) print('breaking file into %s pages' % inputpdf.numPages) # 17 pages for i in range(m

浏览 0提问于2020-07-08得票数 0

1回答

PyPDF2问题和从S3解码pdf文件

、、、

我试图获得一个pdf文件存储在我的一个S3桶中的AWS，并获得它的一些元数据，如页数和文件大小。我成功地从S3桶中获得了pdf文件，在调用print(obj)时获得了这个文件。 s3.Object(bucket_name='somebucketname', key='somefilename.pdf') 当使用PyPDF2.PdfFileReader()时，我尝试使用原始文件、UTF-8解码文件和ISO-8859-1解码文件。ISO-8859-1解码文件是唯一一个不会引发异常的文件，但是当试图将它作为参数传递到PdfFileReader时，我会得到一个错误，并且

浏览 2提问于2018-01-22得票数 2

3回答

将seek添加到filetype对象的廉价方法

、、、

PdfFileReader从pdf文件中读取内容以创建对象。我正在通过urllib.urlopen()从cdn查询pdf，这为我提供了一个类似对象的文件，它没有查找。但是，PdfFileReader使用的是seek。从通过url下载的pdf文件创建PdfFileReader对象的简单方法是什么？现在，我可以做些什么来避免通过file()再次写入磁盘和读取它。提前谢谢。

浏览 0提问于2010-04-16得票数 2

回答已采纳

3回答

如何覆盖Python当前正在读取的文件

、、、

我不太确定这样做的最好方法，但我想做的是，读取pdf文件，进行各种修改，并将修改后的pdf文件保存在原始文件上。到目前为止，我能够保存修改后的pdf到一个单独的文件，但我希望取代原来，而不是创建一个新的文件。下面是我当前的代码： from pyPdf import PdfFileWriter, PdfFileReader output = PdfFileWriter() input = PdfFileReader(file('input.pdf', 'rb')) blank = PdfFileReader(file('C:\\BLANK.pdf

浏览 0提问于2010-05-01得票数 3

回答已采纳

1回答

ValueError:在PyPDF2上对关闭的文件进行查找，并收到此错误

、、

我正在尝试从pdf文件中获取文本。代码如下： from PyPDF2 import PdfFileReader with open('HTTP_Book.pdf', 'rb') as file: pdf = PdfFileReader(file) page = pdf.getPage(1) #print(dir(page)) print(page.extractText()) 这给了我一个错误 ValueError: seek of closed file 我只需将代码放在with语句下，它就能正常工作。我的问题是:为什么会这样呢？我已经将信息存储在

浏览 181提问于2019-05-05得票数 5

回答已采纳

1回答

TypeError:字符串索引必须是pdfreader中的整数

运行此代码时 import PyPDF2 as pdf bikeins = open('pdffileproj12.pdf','rb') read_bikeins = pdf.PdfFileReader(bikeins) 我知道这个错误 read_bikeins = pdf.PdfFileReader(bikeins)回溯(最近一次调用)：文件""，第1行，在read_bikeins = pdf.PdfFileReader(bikeins)中文件"C:\Users\Naveen Raj\Anaconda3\lib\site-pac

浏览 1提问于2018-02-27得票数 0

3回答

PdfFileReader: PdfReadError:无法在指定位置找到xref表

、

我试图通过以下方法读取python中的Pdf文件： from PyPDF2 import PdfFileReader, PdfFileWriter test_reader = PdfFileReader(file("test.pdf", "rb")) 上线抛出错误： PyPDF2.utils.PdfReadError: Could not find xref table at specified location 如有任何帮助，将不胜感激。

浏览 7提问于2015-12-05得票数 6

回答已采纳

1回答

无法使用Python3.x: DependencyError: PyCryptodome算法查找PDF的页数

、

我正在对从url下载的文件执行数据验证。其中一个验证检查涉及检查PDF的页数。使用PyPDF2包和PdfFileReader模块，直到我遇到一个具有权限密码但没有打开密码的256位AES加密的PDF。我无法访问任何密码，因为这些文件来自制造商网站，所以我的结论是，目前我只需检查PDF是否加密，如果是的话，暂时跳过它，但不管我是否试图检索页面计数或检查PDF是否加密，我都会得到以下错误： DependencyError: PyCryptodome is required for AES algorithm 此错误发生在第6行if语句中。尽管已经安装了pycryptodome并导入了AES模块，

浏览 9提问于2022-08-29得票数 0

回答已采纳

2回答

使用Python组合PDF -组合PDF文件时关闭PDF文件

、

我使用下面的组合单独的PDF文件，成为一个单一的PDF。不过，它工作得很好，让所有PDF都打开。当脚本结束时，我如何关闭涉及的PDF文件(即4个文件，包括aaa、bbb、ccc和abc)？例如f.clos()，但是我不知道如何在这里插入。 from pyPdf import PdfFileWriter, PdfFileReader def append_pdf(input,output): [output.addPage(input.getPage(page_num)) for page_num in range(input.numPages)] output = PdfFil

浏览 0提问于2014-07-21得票数 1

回答已采纳

4回答

如何关闭pyPDF "PdfFileReader“类文件句柄

、

这应该是一个非常简单的问题，我在谷歌搜索中找不到答案:如何关闭由pyPDF "PdfFileReader“类打开的文件句柄以下是代码片段： import os.path from pyPdf import PdfFileReader fname = 'my.pdf' input = PdfFileReader(file(fname, "rb")) os.rename(fname, 'my_renamed.pdf') 这会引发错误32 谢谢

浏览 0提问于2010-12-12得票数 10

1回答

OSX中的pyPDF IOError异常

、

我正在尝试使用PdfFileReader从pyPdf模块打开一个pdf (名为kalimera.pdf)，使用以下一组命令 from pyPdf import PdfFileReader, PdfFileWriter document = PdfFileReader(open('kalimera.pdf', 'rb')) 我得到以下错误： Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Librar

浏览 1提问于2016-02-09得票数 0

1回答

用Python打开和预处理文本(300个PDF)

、、

我应该在文件夹中预处理一些PDF文件。我应该删除标点符号，使所有的小写和删除停止词，并添加一些额外的数据从另一个CSV (作为元数据)。但我甚至不能打开它们。所有的googling都没有帮助，因为我不理解错误消息(其他人的例子都没有帮助，因为他们有不同的数据类型)。到目前为止，这是我的代码： import PyPDF2 import re for k in range(1,312): # open the pdf file object = PyPDF2.PdfFileReader("/Users/n_n/Desktop/Digitalization/reserve

浏览 7提问于2022-06-27得票数 0

回答已采纳

3回答

读取目录中的所有PDF(图像)

、、、

我附上了一张图片，以帮助展示我所做的事情。我正在尝试编写一个程序，该程序将添加一个空白页到目录中具有奇数页的所有PDF。然而，我似乎无法读取一个目录中的所有PDF。我有一个单一的PDF的脚本工程，但我有1000的这些要做的。为什么我无法读取user_input目录下的所有PDF文件？代码在这里 from PyPDF2 import PdfFileReader, PdfFileWriter, PdfFileMerger import os user_input = input("Enter the path of your file: ") files = os.lis

浏览 3提问于2017-02-06得票数 0

1回答

我如何分类一个pdf文件的章节和分析每一章的内容？

、、、、

我想分类和分析章节和章节从一本书PDF格式。因此，计算单词的数量，并检查哪个单词发生的频率和在哪一章。 pip install PyPDF2 import PyPDF2 from PyPDF2 import PdfFileReader # Creating a pdf file object pdf = open('C:/Users/Dominik/Desktop/bsc/pdf1.pdf',"rb") # creating pdf reader object pdf_reader = PyPDF2.PdfFileReader(pdf) # checking

浏览 1提问于2019-08-10得票数 1

回答已采纳

2回答

用pyPDF2和BytesIO将PDF页面转换成图像

、、、

我有一个通过pyPdf2从PDF文件中获取页面的函数，它应该用Pillow (PIL )将第一个页面转换为png (或jpg)。 from PyPDF2 import PdfFileWriter, PdfFileReader import os from PIL import Image import io # Open PDF Source # app_path = os.path.dirname(__file__) src_pdf= PdfFileReader(open(os.path.join(app_path, "../../../uploads/%s" % file

浏览 15提问于2017-03-11得票数 3

回答已采纳

2回答

Python从受密码保护的pdf中获取页数

、、

我一直试图找出一种方法，以获得从密码保护的pdf与python3页数。到目前为止，我已经尝试了模块pypdf2和pdfminer2。这两个文件都失败了，因为文件没有解密。 #!/usr/bin/python3 from PyPDF2 import PdfFileReader pdfFile = PdfFileReader(open("document.pdf", "rb")) print(pdfFile.numPages) 此代码将生成一个错误： PyPDF2.utils.PdfReadError: File has not been decrypted 有没有

浏览 4提问于2017-08-22得票数 5

回答已采纳

4回答

pypdf合并多个pdf文件为一个pdf

、

如果我有1000+ pdf文件需要合并到一个pdf中， input = PdfFileReader() output = PdfFileWriter() filename0000 ----- filename 1000 input = PdfFileReader(file(filename, "rb")) pageCount = input.getNumPages() for iPage in range(0, pageCount): output.addPage(input.getPage(iPage)) outputStream =

浏览 3提问于2013-06-14得票数 33

回答已采纳

1回答

Python :不能执行非零的结束相对寻道:UnsupportedOperation- PyPDF2

、

你们能解决这个问题吗？我无法读取阿拉伯语PDF文件。我不知道问题出在哪里。谢谢 import PyPDF2 def main(): with open("arabic_text.pdf", encoding='utf-8') as pdfFile: pdfRead = PyPDF2.PdfFileReader(pdfFile) output = PdfFileWriter() for m in range(pdfRead.getNumPages()): page = pdfRead.getPage(

浏览 0提问于2020-04-30得票数 2

1回答

遍历目录时的字数统计PDF文件

、、、、

你好Stackoverflow社区！我正在尝试构建一个Python程序，它将遍历一个目录(以及所有子目录)，并对所有.html、.txt和.pdf文件进行累计字数统计。当读取.pdf文件时，它需要一些额外的东西(PdfFileReader)来解析文件。在解析.pdf文件时，我得到以下错误，程序停止： AttributeError：'PdfFileReader‘对象没有'startswith’属性如果无法解析.pdf文件，则会完全成功解决问题。代码 #!/usr/bin/python import re import os import sys import os.pat

浏览 36提问于2018-03-06得票数 1

回答已采纳

1回答

PdfFileReader.getFields()返回{}区django

、、

我正试着和django一起读一份pdf表格。关键是，在我的views.py的另一个视图中，我通过使用PyPDF2及其PdfFileReader.getFields()方法成功地完成了这个任务。现在的问题是，读取工作不正常:我已经检查过acrobat，而且文件仍然是一个带有实际字段的表单，所以我不知道问题可能是什么。我在此附上代码的相关部分： if request.method == "POST": form = Form(request.POST, request.FILES) # the form refer to a model called 'N

浏览 3提问于2022-03-22得票数 0

1回答

位置参数需要PyPDF2

、、

在将PDF文件转换为Txt之前，我正在尝试按页拆分PDF文件。我使用这段代码来拆分它，但是它是与位置参数相关的got和错误。我认为它应该是列表的第一页，但是我无法找到将它传递到代码本身的方法。这是密码： from PyPDF2 import PdfFileReader, PdfFileWriter pdf_document = "5Dec2019.pdf" pdf = PdfFileReader(pdf_document) for page in range(pdf.getNumPages()): pdf_writer = PdfFileWriter cur

浏览 2提问于2020-02-16得票数 1

回答已采纳

1回答

如何使用循环合并PDF文件

、、

from PyPDF2 import PdfFileReader, PdfFileWriter, PdfFileMerger merger =PdfFileMerger() merger.append(PdfFileReader(open("seperated1 PDF file556.pdf", 'rb'))) merger.append(PdfFileReader(open("seperated1 PDF file557.pdf", 'rb'))) merger.append(PdfFileReader(open("

浏览 7提问于2022-10-31得票数 0

回答已采纳

1回答

打印到pdf文件中每页第一行的excel

、、、、

我是python的新手，只有一个用来在pdfs中搜索字符串的脚本。现在，我想建立脚本，这将给我在新的CSV/xlsx文件的结果，在那里我将有给定的pdf文件的第一行和他们的页码。现在，我有下面的代码来打印整个页面： from PyPDF2 import PdfFileReader pdf_document = "example.pdf" with open(pdf_document, "rb") as filehandle: pdf = PdfFileReader(filehandle) info = pdf.getDocumentInfo(

浏览 26提问于2020-11-06得票数 0

2回答

PyPDF2按页拆分pdf

、

我想用PyPDF2分割pdf文件。网络中的所有示例都太难或无法工作，或者总是出现错误"AttributeError：'PdfFileWriter‘对象没有属性'stream'“ 有人能帮忙吗？需要一份pdf与3页分成三个不同的文件。我从这开始： pdfFileObj = open(r"D:\BPO\act.pdf", 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) pdfWriter = PyPDF2.PdfFileWriter() pdfWriter.addPage(pd

浏览 8提问于2017-07-17得票数 6

回答已采纳

1回答

无法在python v3.6中合并pdf

、、、、

我有下面的代码段，它已经测试过在python ver2.7中工作，该代码将多个pdf合并成一个pdf。 from PyPDF2 import PdfFileMerger, PdfFileReader #merge individual pdfs of each page into a single pdf merger = PdfFileMerger() for filename in pdf_list: merger.append(PdfFileReader(file("./" + pdf_location + "/" + filename,

浏览 1提问于2017-03-14得票数 0

回答已采纳

2回答

如何使用PyPDF2旋转页面？

、

我正在用pyPDF2编辑一个PDF文件。我成功地生成了我想要的PDF，但是我还没有旋转一些页面。我去了，找到了两种方法：rotateClockwise和rotateCounterClockwise，尽管他们说参数是int，但我无法让它工作。Python说： TypeError: unsupported operand type(s) for +: 'IndirectObject' and 'int' 要产生此错误，请执行以下操作： from PyPDF2 import PdfFileReader, PdfFileWriter reader = PdfFil

浏览 8提问于2017-03-06得票数 8

回答已采纳

1回答

PdfReadWarning: PdfFileReader流/文件对象未处于二进制模式

、

我有很多pdf页面，我想把它们合并成一个文件。我的脚本如下： from PyPDF2 import PdfFileMerger,PdfFileReader filename_list=[] merger = PdfFileMerger() for i in range (0,66): filename='page'+str(i)+'.pdf' if not filename in filename_list: filename_list.append(filename) for filename in filename_list

浏览 6提问于2014-04-01得票数 9

2回答

Flask - PyPDF2 -在内存中导出pdf文件

、

我正在尝试从flask应用程序中导出一个pdf文件，但由于某些原因，我似乎无法正确编写它。当我导出到我的本地文件夹时，它确实起作用了，但是当我通过Flask导出时，我得到一个空白的pdf。有什么想法吗？ pdf = PdfFileWriter() p1 = PdfFileReader(open(os.path.join(STATICDIR,'page1.pdf'), "rb")) p2 = PdfFileReader(open(os.path.join(STATICDIR,'page2.pdf'), "rb

浏览 46提问于2021-07-23得票数 0

回答已采纳

1回答

在Python中从PDF文件中添加页码

、、

下面的Python程序读取一个PDF文件并收集该文件中使用的唯一单词。 import PyPDF2 import re print('process started') pdfFile = open('pdf_file.pdf', 'rb') pdfFileReader = PyPDF2.PdfFileReader(pdfFile) pdfFilePageCount = pdfFileReader.numPages pdfPageText = "" for i in range(pdfFilePageCount):

浏览 1提问于2019-11-23得票数 0

回答已采纳

1回答

从S3存储桶python中提取文本

、、

我的亚马逊网络服务s3存储桶中有多个格式文件，如pdf、doc、rtf、odt、png，我需要从中提取文本。我已经设法获得了内容列表及其路径.now，具体取决于文件类型，我将使用不同的库从文件中提取文本。由于文件可能数以千计，我需要直接从s3中提取文本，而不是下载。 filespath=['https://abc.s3.ap-south-1.amazonaws.com/DocumentOnPATest', 'https://abc.s3.ap-south-1.amazonaws.com/IndustryReport2019.pdf', 'https://

浏览 11提问于2021-01-19得票数 0

1回答

创建AWS函数来拆分s3桶中的pdf文件

、、、

我想编写一个AWS Lambda函数，该函数：从s3桶->中获取pdf文件，将->存储的pdf文件拆分成S3桶。我正在使用PyPDF模块，因此需要知道如何在aws函数中使用它。拆分pdf文件的代码： import os from PyPDF2 import PdfFileReader, PdfFileWriter pdf_file_path = 'filename.pdf' file_base_name = pdf_file_path.replace('.pdf','') output_folder_path = os.pat

浏览 1提问于2022-04-09得票数 2

1回答

如何编辑pdf文件，替换其数据？

、、、

我试图在一个pdf文件中旋转页面，然后用同一个pdf文件中的旋转页面替换旧页面。我编写了以下代码： #!/usr/bin/python import os from pyPdf import PdfFileReader, PdfFileWriter my_path = "/home/USER/Desktop/files/" input_file_name = os.path.join(my_path, "myfile.pdf") input_file = PdfFileReader(file(input_file_name, "rb"))

浏览 7提问于2015-02-23得票数 1

回答已采纳

1回答

使用Python和PyPDF2合并PDF文件会抛出一个TypeError

、、

我使用Python 3.6.5将PDF合并在一起，但遇到了一个问题。下面的代码引发一个'TypeError: 'NumberObject' object is not subscriptable'错误。我做错了什么？当我用merger.append注释掉这一行时，它会正确地打印出文件路径。 import webbrowser import os from PyPDF2 import PdfFileMerger, PdfFileReader path = 'C:/test/pdfs' merger = PdfFileMerger() for pd

浏览 0提问于2018-04-06得票数 4

1回答

在Python中提取文本时的UnicodeEncodeError

我试图从PDF文件中提取内容，并将其存储在文本文件中。对于PDF文件中的第1页(pdfreader.getPage(0))，我的代码运行良好，但是当我对第2页执行此操作时，我得到一个错误： UnicodeEncodeError: 'gbk' codec can't encode character '\u2122' in position 1831: illegal multibyte sequence 我不确定这是什么意思，因为我是Python新手，我的代码是： import PyPDF2 pdffileobj=open('meetingmin

浏览 108提问于2018-06-12得票数 0

回答已采纳

2回答

Python不使用pyPDF2打印PDF

、、、

我尝试打印pdf文档的页面： import PyPDF2 FILE_PATH = 'my.pdf' with open(FILE_PATH, mode='rb') as f: reader = PyPDF2.PdfFileReader(f) page = reader.getPage(0) # I tried also other pages e.g 1,2,.. print(page.extractText()) 但是我只得到了很多空格，并且没有错误信息。会不会是PyPDF2不支持这个pdf版本(my.pdf)？这就解决了这个问题(

浏览 35提问于2020-04-22得票数 1

回答已采纳

1回答

读取PDF文件python - pypdf2时出现断言错误

、、、

当我尝试读取PDF文件时，出现以下错误。代码： from PyPDF2 import PdfFileReader import os os.chdir("Path to dir") pdf_document = 'sample.pdf' pdf = PdfFileReader(pdf_document,'rb') #Error here 错误： Traceback (most recent call last): File "/home/krishna/PycharmProjects/sample/sample.py", l

浏览 45提问于2020-05-21得票数 0

1回答

如何用python中的S3和boto3来读取AWS中的pdf文件？

、、、

我想在.pdf桶中读取S3文件，但问题是它返回格式化的字节，而如果文件是在.csv或.txt中，那么这段代码对.pdf文件有什么问题？守则： import boto3 s3client = boto3.client('s3') fileobj = s3client.get_object( Bucket=BUCKET_NAME, Key='file.pdf' ) filedata = fileobj['Body'].read() contents = filedata print(contents) 它返回： b

浏览 13提问于2022-01-19得票数 -2

3回答

如何在python中从文件夹中逐个读取pdf文件

、

我正在阅读pdf文件，并试图通过NLP techniques.Right从它们提取关键字，现在程序一次只接受一个pdf。我在D盘上有一个文件夹，叫做'pdf_docs‘。该文件夹包含许多pdf文档。我的目标是从文件夹中一个接一个地读取每个pdf文件。我如何在python中做到这一点。到目前为止成功运行的代码如下所示。 import PyPDF2 file = open('abc.pdf','rb') fileReader = PyPDF2.PdfFileReader(file) count = 0 while count < 3:

浏览 3提问于2018-10-28得票数 0

回答已采纳

1回答

无法从web上的PDF文件中获取文本数据

我正在努力从网络上的PDF文件中获取“文本”数据。但我做不到。样本： requests使用守则： import requests r = requests.get('https://otd.harvard.edu/upload/files/OTD_Startup_Guide.pdf') print(r.content) 可以得到数据，但不是“文本”。它被加密了。 Adobe InDesign 7.0\n /；元数据\n \n保存了xmp.iid:63935542733768118A6DE4CA3B065193\n 2011-08-16T10:14:49-04:00\n

浏览 2提问于2020-05-13得票数 1

回答已采纳

2回答

无法用PyPDF2打开PDF文件

、、

我正在使用Python3.8.5。我正在尝试编写一个连接PDF文件并向学习的短脚本，我正在尝试使用。不幸的是，我似乎无法在不崩溃的情况下创建一个PyPDF2.PdfFileReader实例。我的代码如下所示： import pathlib import PyPDF2 pdf_path = pathlib.Path('1.pdf') with pdf_path.open('rb') as pdf_file: reader = PyPDF2.PdfFileReader(pdf_file, strict=False) 当我尝试运行它时，我会得到以下跟踪：

浏览 0提问于2020-09-26得票数 6

回答已采纳

1回答

无法将PDF转换为文本格式

、、、

我得到了这个错误，而解析的PDF文件使用pypdf2，我是随同错误的PDF附件。 I have attached the PDF to be parsed please click to view 有人能帮上忙吗？ import PyPDF2 def convert(data): pdfName = data read_pdf = PyPDF2.PdfFileReader(pdfName) page = read_pdf.getPage(0) page_content = page.extractText() print(page_content)

浏览 32提问于2019-04-14得票数 0

1回答

如何使用PyPDF2从上传到Google App Engine的PDF中提取文本？

、、

有没有办法从通过谷歌应用程序引擎上传的PDF文件中提取文本和documentInfo？我想使用PyPDF2，我的代码是： pdf_file = self.request.POST['file'].file pdf_reader = pypdf.PdfFileReader(pdf_file) 这给了我一个错误： Traceback (most recent call last): .... File "/myrepo/myproj/main.py", line 154, in post pdf_text = pypdf.PdfFileReader(pd

浏览 4提问于2014-01-13得票数 0

1回答

请求:从url返回文件对象(使用open(''，'rb') )

、、、、

我希望使用requests将文件直接下载到内存中，以便将其直接传递给PyPDF2阅读器，避免将其写入磁盘，但我不知道如何将其作为file object传递。以下是我尝试过的： import requests as req from PyPDF2 import PdfFileReader r_file = req.get('http://www.location.come/somefile.pdf') rs_file = req.get('http://www.location.come/somefile.pdf', stream=True) with ope

浏览 0提问于2015-05-05得票数 15

回答已采纳

1回答

用于拆分多页PDF的Python脚本可以与一些PDF而不是其他PDF一起工作

、

我在Windows中使用下面的脚本来拆分多页PDF。剧本是这样的..。 from PyPDF2 import PdfFileWriter, PdfFileReader inputpdf = PdfFileReader(open("*pathToPDF**”, "rb")) for i in range(inputpdf.numPages): output = PdfFileWriter() output.addPage(inputpdf.getPage(i)) with open("document-page%s.pdf" %

浏览 3提问于2020-05-29得票数 1

回答已采纳

2回答

Python:用UTF-8以二进制模式打开PDF

、、

我试图使用PyPDF4打开一个PDF文件。 import PyPDF4 text = "" pdf_file = open(filename,mode='rb') pdfReader = PyPDF4.PdfFileReader(pdf_file) pdfObj = pdfReader.getPage(0) text = pageObj.extract(pdfObj) print(text) 它的工作很好，除了PDF的内容是德语和特殊字符(乌姆卢特)编码错误(例如。zun−chst而不是zun chst)。我无法更改二进制代码的编码，但如果不使用二进制代

浏览 11提问于2020-10-21得票数 1

回答已采纳

4回答

使用太多的公开电话。如何关闭所有文件？

、

我在试着修改很多pdf文件。因为这个，我必须打开很多文件。我多次使用这种方法。因此，python为错误提供了太多打开的文件。我希望我的代码是grace.writer很多太相似了 readerbanner = PyPDF2.pdf.PdfFileReader(open('transafe.pdf', 'rb')) readertestpages = PyPDF2.pdf.PdfFileReader(open(os.path.join(Cache_path, cache_file_name), 'rb')) writeroutput.write(o

浏览 6提问于2015-06-11得票数 1

1回答

Tabula-py找不到pdf文件

、、、、

我想用和解析一个PDF文件我阅读了并使用了以下代码：从pdfminer.pdfparser导入PDFParser从pdfminer.pdfdocument导入PDFDocument import magic from pyPdf import PdfFileWriter, PdfFileReader import tabula import numpy as np filename = '/home/parser/test.pdf' magic.from_file(filename,mime=True) ifpdf = PdfFileReader(file(filenam

浏览 0提问于2018-08-02得票数 2

2回答

根据文件名和追加对文件进行循环和匹配的脚本

、

我有一个包含许多文件的目录，这些文件的名称如下： 1234_part1.pdf 1234.pdf 5432_part1.pdf 5432.pdf 2323_part1.pdf 2323.pdf etc. 我正在尝试合并的pdf文件的第一个数字部分是相同的。我的代码可以一次完成一个，但是当我在目录中有超过500个文件时，我不确定如何循环，以下是我到目前为止所拥有的： from PyPDF2 import PdfFileMerger, PdfFileReader merger = PdfFileMerger() merger.append(PdfFileReader(file('c:/ex

浏览 0提问于2019-05-18得票数 3

3回答

用for循环重命名pdf文件列表

、、

我试图通过使用PyPdf从文件中提取名称来重命名一个pdf文件列表。我尝试使用for循环来重命名文件，但是总是会出现一个错误，代码32说该文件正被另一个进程使用。我正在使用python2.7，这是我的代码 import os, glob from pyPdf import PdfFileWriter, PdfFileReader # this function extracts the name of the file def getName(filepath): output = PdfFileWriter() input = PdfFileReader(file(file

浏览 3提问于2013-11-14得票数 0

回答已采纳

1回答

如何使用python 3和.docx (或任何其他方式)将pdf转换成一个PyPDF2文件？

、

我想将一个.pdf转换成一个.docx文件。我尝试过几种方法，但这似乎是最好的方法(如果我错了，请纠正我)。我见过这个，但它对我没有用--它和下面的一样： import PyPDF2 path=r"C:\Users\name\Desktop\test maker tester\Computer Science\414838-2020-specimen-paper-1.pdf" text="" pdf_file = open(path, 'rb') text ="" read_pdf = PyPDF2.PdfFileReader(

浏览 1提问于2020-01-21得票数 0

回答已采纳