blocks|key|3711350|text|PDF不包含表格数据，除非它包含结构化内容。一些工具包括尝试猜测数据结构并将其放回原处的启发式方法。我在http://www.jpedal.org/PDFblog/2009/04/pdf-text/上写了一篇博客文章，解释了PDF文本提取的问题|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|3711351|entityMap|0|LINK|mutability|MUTABLE|url|http://www.jpedal.org/PDFblog/2009/04/pdf-text/^0|1G|1B|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@$A|M|B|N|1|O]]|C|$]]|$1|D|3|-4|5|6|7|P|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]]]

PDFs do not contain tabular data unless it contains structured content. Some tools include heuristics to try and guess the data structure and put it back. I wrote a blog article explaining the issues with PDF text extraction at <a href="http://www.jpedal.org/PDFblog/2009/04/pdf-text/" rel="nofollow noreferrer">http://www.jpedal.org/PDFblog/2009/04/pdf-text/</a>

blocks|key|670599|text|$+pdftotext+-layout+thingwithtablesinit.pdf|type|code-block|depth|inlineStyleRanges|entityRanges|data|syntax|javascript|670600|将生成一个文本文件thingwithtablesinit.txt，其中的表是正确的。|unstyled|670601|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$B|C]]|$1|D|3|E|5|F|7|J|8|@]|9|@]|A|$]]|$1|G|3|-4|5|F|7|K|8|@]|9|@]|A|$]]]|H|$]]

<pre><code>$ pdftotext -layout thingwithtablesinit.pdf
</code></pre>

will produce a text file thingwithtablesinit.txt with the tables right.

blocks|key|668885|text|我遇到了类似的问题，最终使用了http://www.foolabs.com/xpdf/的XPDF，其中一个实用程序是PDFtoText，但我猜这一切都取决于PDF是如何产生的。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|668886|entityMap|0|LINK|mutability|MUTABLE|url|http://www.foolabs.com/xpdf/^0|F|S|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@$A|M|B|N|1|O]]|C|$]]|$1|D|3|-4|5|6|7|P|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]]]

I had a similar problem and ended up using XPDF from <a href="http://www.foolabs.com/xpdf/" rel="nofollow">http://www.foolabs.com/xpdf/</a>
One of the utils is PDFtoText, but I guess it all comes up to, how the PDF was produced.

blocks|key|3711476|text|正如在其他答案中所解释的，从PDF中提取文本不是一项简单的任务。但是，有些Python库，比如+(+for+Python3)是相当高效的。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3711477|下面的代码片段显示了一个Python类，它可以被实例化以从PDF中提取文本。这在大多数情况下都是有效的。|3711478|(来源-+https://gist.github.com/vinovator/a46341c77273760aa2bb)|offset|length|3711479|#+Python+2.7.6
#+PdfAdapter.py

"""+Reusable+library+to+extract+text+from+pdf+file
Uses+pdfminer+library;+For+Python+3.x+use+pdfminer3k+module
Below+links+have+useful+information+on+components+of+the+program
https://euske.github.io/pdfminer/programming.html
http://denis.papathanasiou.org/posts/2010.08.04.post.html
"""


from+pdfminer.pdfparser+import+PDFParser
from+pdfminer.pdfdocument+import+PDFDocument
from+pdfminer.pdfpage+import+PDFPage
#+From+PDFInterpreter+import+both+PDFResourceManager+and+PDFPageInterpreter
from+pdfminer.pdfinterp+import+PDFResourceManager,+PDFPageInterpreter
#+from+pdfminer.pdfdevice+import+PDFDevice
#+To+raise+exception+whenever+text+extraction+from+PDF+is+not+allowed
from+pdfminer.pdfpage+import+PDFTextExtractionNotAllowed
from+pdfminer.layout+import+LAParams,+LTTextBox,+LTTextLine
from+pdfminer.converter+import+PDFPageAggregator
import+logging

__doc__+=+"eusable+library+to+extract+text+from+pdf+file"
__name__+=+"pdfAdapter"

"""+Basic+logging+config
"""
log+=+logging.getLogger(__name__)
log.addHandler(logging.NullHandler())


class+pdf_text_extractor:
++++"""+Modules+overview:
+++++-+PDFParser:+fetches+data+from+pdf+file
+++++-+PDFDocument:+stores+data+parsed+by+PDFParser
+++++-+PDFPageInterpreter:+processes+page+contents+from+PDFDocument
+++++-+PDFDevice:+translates+processed+information+from+PDFPageInterpreter
++++++++to+whatever+you+need
+++++-+PDFResourceManager:+Stores+shared+resources+such+as+fonts+or+images
++++++++used+by+both+PDFPageInterpreter+and+PDFDevice
+++++-+LAParams:+A+layout+analyzer+returns+a+LTPage+object+for+each+page+in
+++++++++the+PDF+document
+++++-+PDFPageAggregator:+Extract+the+decive+to+page+aggregator+to+get+LT
+++++++++object+elements
++++"""

def+__init__(self,+pdf_file_path,+password=""):
++++"""+Class+initialization+block.
++++Pdf_file_path+-+Full+path+of+pdf+including+name
++++password+=+If+not+passed,+assumed+as+none
++++"""
++++self.pdf_file_path+=+pdf_file_path
++++self.password+=+password

def+getText(self):
++++"""+Algorithm:
++++1)+Txr+information+from+PDF+file+to+PDF+document+object+using+parser
++++2)+Open+the+PDF+file
++++3)+Parse+the+file+using+PDFParser+object
++++4)+Assign+the+parsed+content+to+PDFDocument+object
++++5)+Now+the+information+in+this+PDFDocumet+object+has+to+be+processed.
++++For+this+we+need+PDFPageInterpreter,+PDFDevice+and+PDFResourceManager
++++6)+Finally+process+the+file+page+by+page
++++"""

++++#+Open+and+read+the+pdf+file+in+binary+mode
++++with+open(self.pdf_file_path,+"rb")+as+fp:

++++++++#+Create+parser+object+to+parse+the+pdf+content
++++++++parser+=+PDFParser(fp)

++++++++#+Store+the+parsed+content+in+PDFDocument+object
++++++++document+=+PDFDocument(parser,+self.password)

++++++++#+Check+if+document+is+extractable,+if+not+abort
++++++++if+not+document.is_extractable:
++++++++++++raise+PDFTextExtractionNotAllowed

++++++++#+Create+PDFResourceManager+object+that+stores+shared+resources
++++++++#+such+as+fonts+or+images
++++++++rsrcmgr+=+PDFResourceManager()

++++++++#+set+parameters+for+analysis
++++++++laparams+=+LAParams()

++++++++#+Create+a+PDFDevice+object+which+translates+interpreted
++++++++#+information+into+desired+format
++++++++#+Device+to+connect+to+resource+manager+to+store+shared+resources
++++++++#+device+=+PDFDevice(rsrcmgr)
++++++++#+Extract+the+decive+to+page+aggregator+to+get+LT+object+elements
++++++++device+=+PDFPageAggregator(rsrcmgr,+laparams=laparams)

++++++++#+Create+interpreter+object+to+process+content+from+PDFDocument
++++++++#+Interpreter+needs+to+be+connected+to+resource+manager+for+shared
++++++++#+resources+and+device
++++++++interpreter+=+PDFPageInterpreter(rsrcmgr,+device)

++++++++#+Initialize+the+text
++++++++extracted_text+=+""

++++++++#+Ok+now+that+we+have+everything+to+process+a+pdf+document,
++++++++#+lets+process+it+page+by+page
++++++++for+page+in+PDFPage.create_pages(document):
++++++++++++#+As+the+interpreter+processes+the+page+stored+in+PDFDocument
++++++++++++#+object
++++++++++++interpreter.process_page(page)
++++++++++++#+The+device+renders+the+layout+from+interpreter
++++++++++++layout+=+device.get_result()
++++++++++++#+Out+of+the+many+LT+objects+within+layout,+we+are+interested
++++++++++++#+in+LTTextBox+and+LTTextLine
++++++++++++for+lt_obj+in+layout:
++++++++++++++++if+(isinstance(lt_obj,+LTTextBox)+or
++++++++++++++++++++++++isinstance(lt_obj,+LTTextLine)):
++++++++++++++++++++extracted_text+%2B=+lt_obj.get_text()

++++return+extracted_text.encode("utf-8")|code-block|syntax|javascript|3711480|注意-还有其他的库，如，擅长转换PDF，如合并PDF页面，拆分或裁剪PDF中的特定页面等。|3711481|entityMap|0|LINK|mutability|MUTABLE|url|https://gist.github.com/vinovator/a46341c77273760aa2bb^0|0|0|5|1I|0|0|0|0^^$0|@$1|2|3|4|5|6|7|W|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|X|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|Y|8|@]|9|@$F|Z|G|10|1|11]]|A|$]]|$1|H|3|I|5|J|7|12|8|@]|9|@]|A|$K|L]]|$1|M|3|N|5|6|7|13|8|@]|9|@]|A|$]]|$1|O|3|-4|5|6|7|14|8|@]|9|@]|A|$]]]|P|$Q|$5|R|S|T|A|$U|V]]]]

As explained in other answers, extracting text from PDF is not a straight forward task. However there are certain Python libraries such as <a href="https://pypi.python.org/pypi/pdfminer/" rel="nofollow">pdfminer</a> (<a href="https://pypi.python.org/pypi/pdfminer3k" rel="nofollow">pdfminer3k</a> for Python 3) that are reasonably efficient. 

The code snippet below shows a Python class which can be instantiated to extract text from PDF. This will work in most of the cases. 

(source - <a href="https://gist.github.com/vinovator/a46341c77273760aa2bb" rel="nofollow">https://gist.github.com/vinovator/a46341c77273760aa2bb</a>)

<pre><code># Python 2.7.6
# PdfAdapter.py

""" Reusable library to extract text from pdf file
Uses pdfminer library; For Python 3.x use pdfminer3k module
Below links have useful information on components of the program
https://euske.github.io/pdfminer/programming.html
http://denis.papathanasiou.org/posts/2010.08.04.post.html
"""


from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
# From PDFInterpreter import both PDFResourceManager and PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
# from pdfminer.pdfdevice import PDFDevice
# To raise exception whenever text extraction from PDF is not allowed
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
from pdfminer.converter import PDFPageAggregator
import logging

__doc__ = "eusable library to extract text from pdf file"
__name__ = "pdfAdapter"

""" Basic logging config
"""
log = logging.getLogger(__name__)
log.addHandler(logging.NullHandler())


class pdf_text_extractor:
 """ Modules overview:
 - PDFParser: fetches data from pdf file
 - PDFDocument: stores data parsed by PDFParser
 - PDFPageInterpreter: processes page contents from PDFDocument
 - PDFDevice: translates processed information from PDFPageInterpreter
 to whatever you need
 - PDFResourceManager: Stores shared resources such as fonts or images
 used by both PDFPageInterpreter and PDFDevice
 - LAParams: A layout analyzer returns a LTPage object for each page in
 the PDF document
 - PDFPageAggregator: Extract the decive to page aggregator to get LT
 object elements
 """

def __init__(self, pdf_file_path, password=""):
 """ Class initialization block.
 Pdf_file_path - Full path of pdf including name
 password = If not passed, assumed as none
 """
 self.pdf_file_path = pdf_file_path
 self.password = password

def getText(self):
 """ Algorithm:
 1) Txr information from PDF file to PDF document object using parser
 2) Open the PDF file
 3) Parse the file using PDFParser object
 4) Assign the parsed content to PDFDocument object
 5) Now the information in this PDFDocumet object has to be processed.
 For this we need PDFPageInterpreter, PDFDevice and PDFResourceManager
 6) Finally process the file page by page
 """

 # Open and read the pdf file in binary mode
 with open(self.pdf_file_path, "rb") as fp:

 # Create parser object to parse the pdf content
 parser = PDFParser(fp)

 # Store the parsed content in PDFDocument object
 document = PDFDocument(parser, self.password)

 # Check if document is extractable, if not abort
 if not document.is_extractable:
 raise PDFTextExtractionNotAllowed

 # Create PDFResourceManager object that stores shared resources
 # such as fonts or images
 rsrcmgr = PDFResourceManager()

 # set parameters for analysis
 laparams = LAParams()

 # Create a PDFDevice object which translates interpreted
 # information into desired format
 # Device to connect to resource manager to store shared resources
 # device = PDFDevice(rsrcmgr)
 # Extract the decive to page aggregator to get LT object elements
 device = PDFPageAggregator(rsrcmgr, laparams=laparams)

 # Create interpreter object to process content from PDFDocument
 # Interpreter needs to be connected to resource manager for shared
 # resources and device
 interpreter = PDFPageInterpreter(rsrcmgr, device)

 # Initialize the text
 extracted_text = ""

 # Ok now that we have everything to process a pdf document,
 # lets process it page by page
 for page in PDFPage.create_pages(document):
 # As the interpreter processes the page stored in PDFDocument
 # object
 interpreter.process_page(page)
 # The device renders the layout from interpreter
 layout = device.get_result()
 # Out of the many LT objects within layout, we are interested
 # in LTTextBox and LTTextLine
 for lt_obj in layout:
 if (isinstance(lt_obj, LTTextBox) or
 isinstance(lt_obj, LTTextLine)):
 extracted_text += lt_obj.get_text()

 return extracted_text.encode("utf-8")
</code></pre>

Note - There are other libraries such as <a href="https://pypi.python.org/pypi/PyPDF2/1.26.0" rel="nofollow">PyPDF2</a> which are good at transforming a PDF, such as merging PDF pages, splitting or cropping specific pages out of PDF etc.

I have a bunch of PDF files that I need to convert to TXT. Unfortunately, when i use one of the many available utilities to do this, it loses all formatting and all the tabulated data in the PDF gets jumbled up. Is it possible to use Python to extract the text from the PDF by specifying postions, etc?

Thanks.

Extract text from PDF

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我有一堆PDF文件，我需要转换为TXT。不幸的是，当我使用许多可用的实用程序中的一个来做这件事时，它丢失了所有的格式，并且PDF中的所有表格数据变得混乱。可以通过指定位置等方式使用Python从PDF中提取文本吗？谢谢。

问从PDF中提取文本
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从PDF中提取文本EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从PDF中提取文本
EN