手把手教你用Python分割与合并PDF

文章来源：企鹅号 - FlyAI

访问flyai.club，一键创建你的人工智能项目。

在工作中，可能会涉及处理 pdf 文件，PyPDF2提供了读，分割，合并，文件转换等多种操作，可以让你轻松的处理 pdf 文件。在本文中，我们将学习如何简单的拆分与合并PDF文件。

文档地址：https://pythonhosted.org/PyPDF2/

入门

PyPDF2并不是Python标准库的一部分，因此需要自己安装。最好的方式就是使用pip。

pip install pypdf2

开始学习吧~

拆分PDF

PyPDF2能够将单个PDF分成多个PDF。你只需要告诉它你想要多少页。在这个例子中，我们将从IRS载一个W9表格并遍历，然后拆分每6页并将其转换为自己的独立PDF。

让我们来看看如何：

# pdf_splitter.pyimportosfromPyPDF2importPdfFileReader, PdfFileWriterdefpdf_splitter(path): fname =os.path.splitext(os.path.basename(path))[]pdf = PdfFileReader(path)forpageinrange(pdf.getNumPages()): pdf_writer = PdfFileWriter()pdf_writer.addPage(pdf.getPage(page))output_filename ='{}_page_{}.pdf'.format(fname, page+1)withopen(output_filename,'wb')as out: pdf_writer.write(out)print('Created: {}'.format(output_filename))if__name__ =='__main__': path ='w9.pdf'pdf_splitter(path)

对于这个例子，我们需要导入PdfFileReader和PdfFileWriter。然后创建一个有趣的小函数pdf_splitter。它接受输入PDF的路径。该函数的第一行将获取输入文件的名称，减去扩展名。接下来我们打开PDF并创建一个阅读器对象。然后我们使用reader对象的getNumPages方法遍历所有页面。

在for循环的内部，我们创建了一个PdfFileWriter的实例。然后，我们使用addPage方法将页面添加到我们的writer对象。这个方法接受一个页面对象，所以为了得到页面对象，我们调用reader对象的getPage方法。现在我们已经为我们的作者对象添加了一个页面。下一步是创建一个唯一的文件名，我们通过使用原始文件名加上单词“page”加上页码+ 1来完成。我们添加一个，因为PyPDF2的页码是从零开始的，所以第0页实际上是第1页。

最后，我们以写入二进制模式打开新的文件名，并使用PDF write对象的写入方法将对象的内容进行存储。

合并多个PDF

现在我们有了一堆PDF，如何把它们合并到一起？

在PyPDF发布时，合并多个PDF的唯一方法就是这样：

# pdf_merger.pyimportglobfromPyPDF2importPdfFileWriter, PdfFileReaderdefmerger(output_path, input_paths): pdf_writer = PdfFileWriter()forpathininput_paths: pdf_reader = PdfFileReader(path)forpageinrange(pdf_reader.getNumPages()): pdf_writer.addPage(pdf_reader.getPage(page))withopen(output_path,'wb')as fh: pdf_writer.write(fh)if__name__ =='__main__': paths =glob.glob('w9_*.pdf')paths.sort()merger('pdf_merger.pdf', paths)

对于每个PDF路径，我们都创建一个PdfFileReader对象，然后遍历它的页面，将每个页面添加到我们的writer对象。然后我们写出writer对象的内容到磁盘。

通过创建一个PdfFileMerger对象，PyPDF2让操作更简单一些：

# pdf_merger2.pyimportglobfromPyPDF2importPdfFileMergerdefmerger(output_path, input_paths): pdf_merger = PdfFileMerger()file_handles =[]forpathininput_paths: pdf_merger.append(path)withopen(output_path,'wb')as fileobj: pdf_merger.write(fileobj)if__name__ =='__main__': paths =glob.glob('w9_*.pdf')paths.sort()merger('pdf_merger2.pdf', paths)

在这里，我们只需要创建PdfFileMerger对象，然后遍历PDF路径，将它们附加到我们的合并对象。PyPDF2会自动附加整个文档，所以你不需要循环遍历每个文档的所有页面。然后我们将它写出到磁盘。

PdfFileMerger类也有一个合并，可以使用该方法。它的代码定义如下所示：

defmerge(self, position, fileobj, bookmark=None, pages=None, import_bookmarks=True):""" Merges the pages from the given file into the output file at the specified page number. :param int position: The *page number* to insert this file. File will be inserted after the given number. :param fileobj: A File Object or an object that supports the standard read and seek methods similar to a File Object. Could also be a string representing a path to a PDF file. :param str bookmark: Optionally, you may specify a bookmark to be applied at the beginning of the included file by supplying the text of the bookmark. :param pages: can be a :ref:`Page Range

` or a ``(start, stop[, step])`` tuple to merge only the specified range of pages from the source document into the output document. :param bool import_bookmarks: You may prevent the source document's bookmarks from being imported by specifying this as ``False``. """

试一试，看看你能做什么。

原文：http://www.blog.pythonlibrary.org/2018/04/11/splitting-and-merging-pdfs-with-python/#more-7268

— End —

发表于: 2018-04-222018-04-22 10:03:34
原文链接：http://kuaibao.qq.com/s/20180422A0BJLW00?refer=cp_1026
腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号（企鹅号）传播渠道之一，根据《腾讯内容开放平台服务协议》转载发布内容。
如有侵权，请联系 cloudcommunity@tencent.com 删除。

扫码

添加站长进交流群

领取专属 10元无门槛券

私享最新 技术干货

手把手教你用Python分割与合并PDF

相关快讯

扫码

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐