blocks|key|3023712|text|我目前能想到的最好的东西(在“简单”工具列表中)是Ghostscript+(当前版本是v.8.71)和PostScript实用程序ps2ascii.ps。Ghostscript将其放在其lib子目录中。尝试以下操作(在Windows上)：|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|3023713|gswin32c.exe+%5E
+++-q+%5E
+++-sFONTPATH=c:/windows/fonts+%5E
+++-dNODISPLAY+%5E
+++-dSAFER+%5E
+++-dDELAYBIND+%5E
+++-dWRITESYSTEMDICT+%5E
+++-dCOMPLEX+%5E
+++-f+ps2ascii.ps+%5E
+++-dFirstPage=3+%5E
+++-dLastPage=7+%5E
+++input.pdf+%5E
+++-dQUIET+%5E
+++-c+quit|code-block|syntax|javascript|3023714|此命令处理input.pdf的第3-7页。阅读ps2ascii.ps文件本身中的注释，了解“奇怪的”数字和其他信息的含义(它们表示字符串、位置、宽度、颜色、图片、矩形、字体和分页符...)。要获得“简单”的文本输出，请用-dSIMPLE替换-dCOMPLEX部件。|3023715|entityMap|0|LINK|mutability|MUTABLE|url|http://www.ghostscript.com/releases/^0|1T|B|2L|3|P|B|0|0|0|5|9|N|B|32|8|3C|9|0^^$0|@$1|2|3|4|5|6|7|U|8|@$9|V|A|W|B|C]|$9|X|A|Y|B|C]]|D|@$9|Z|A|10|1|11]]|E|$]]|$1|F|3|G|5|H|7|12|8|@]|D|@]|E|$I|J]]|$1|K|3|L|5|6|7|13|8|@$9|14|A|15|B|C]|$9|16|A|17|B|C]|$9|18|A|19|B|C]|$9|1A|A|1B|B|C]]|D|@]|E|$]]|$1|M|3|-4|5|6|7|1C|8|@]|D|@]|E|$]]]|N|$O|$5|P|Q|R|E|$S|T]]]]

The best thing I can currently think of (within the list of "simple" tools) is <a href="http://www.ghostscript.com/releases/" rel="nofollow noreferrer">Ghostscript</a> (current version is v.8.71) and the PostScript utility program <code>ps2ascii.ps</code>. Ghostscript ships it in its <code>lib</code> subdirectory. Try this (on Windows):

<pre><code>gswin32c.exe ^
 -q ^
 -sFONTPATH=c:/windows/fonts ^
 -dNODISPLAY ^
 -dSAFER ^
 -dDELAYBIND ^
 -dWRITESYSTEMDICT ^
 -dCOMPLEX ^
 -f ps2ascii.ps ^
 -dFirstPage=3 ^
 -dLastPage=7 ^
 input.pdf ^
 -dQUIET ^
 -c quit
</code></pre>

This command processes pages 3-7 of <code>input.pdf</code>. Read the comments in the <code>ps2ascii.ps</code> file itself to see what the "weird" numbers and additional infos mean (they indicate strings, positions, widths, colors, pictures, rectangles, fonts and page breaks...). To get a "simple" text output, replace the <code>-dCOMPLEX</code> part by <code>-dSIMPLE</code>.

blocks|key|298284|text|QuickPDF似乎是一个合理的库，它应该以合理的价格做你想做的事情。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|298285|http://www.quickpdflibrary.com/+-他们有30天的试用期。|offset|length|298286|entityMap|0|LINK|mutability|MUTABLE|url|http://www.quickpdflibrary.com/^0|0|0|V|0|0^^$0|@$1|2|3|4|5|6|7|N|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|O|8|@]|9|@$D|P|E|Q|1|R]]|A|$]]|$1|F|3|-4|5|6|7|S|8|@]|9|@]|A|$]]]|G|$H|$5|I|J|K|A|$L|M]]]]

QuickPDF seems to be a reasonable library that should do what you want for a reasonable price.

<a href="http://www.quickpdflibrary.com/" rel="nofollow noreferrer">http://www.quickpdflibrary.com/</a> - They have a 30 day trial.

blocks|key|3023863|text|从今天开始，我知道:从PDF中提取文本的最好方法是。TET是PDFlib.com系列产品的一部分。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3023864|PDFlib.com是Thomas+Merz的公司。如果你不记得他的名字:+Thomas+Merz是"PostScript和PDF圣经“的作者。|3023865|TET的第一个化身是。它可能可以做Budda006想做的所有事情，包括页面上每个元素的位置信息。哦，它还可以提取图像。它将被分割成碎片的图像重新组合。|3023866|pdflib.com还提供了这种技术的另一个化身--+。第三个化身是。这是一个用于用户桌面的独立工具。这两者都是免费的(就像在啤酒中一样)，可以用于私人的、非商业的目的。|3023867|它真的很强大。比Adobe自己的文本提取要好得多。它为我提取文本，而其他工具(包括Adobe的)只会输出垃圾。|3023868|我刚刚测试了桌面独立工具，他们在网页上说的都是真的。它有一个非常好的命令行。我的一些“有问题的”PDF测试文件，该工具处理，使我完全满意。|3023869|从现在开始，这将是我对每一个复杂和具有挑战性的PDF文本提取需求的建议。|3023870|TET简直太棒了。它可以检测表格。在表中，它标识跨越多个列的单元格。它分别标识表格行和每个表格单元格的内容。它很好地处理了连字符:它删除了连字符并恢复了完整的单词。它支持非ASCII语言(包括CJK、阿拉伯语和希伯来语)。当遇到连字时，它会恢复原来的字符...|3023871|试试看。|3023872|entityMap^0|0|0|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|T|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|U|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|V|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|W|8|@]|9|@]|A|$]]|$1|H|3|I|5|6|7|X|8|@]|9|@]|A|$]]|$1|J|3|K|5|6|7|Y|8|@]|9|@]|A|$]]|$1|L|3|M|5|6|7|Z|8|@]|9|@]|A|$]]|$1|N|3|O|5|6|7|10|8|@]|9|@]|A|$]]|$1|P|3|Q|5|6|7|11|8|@]|9|@]|A|$]]|$1|R|3|-4|5|6|7|12|8|@]|9|@]|A|$]]]|S|$]]

Since today I know it: the best thing for text extraction from PDFs is <a href="http://www.pdflib.com/newsticker/single-news/article/pdflib-tet-4-product-family-available/" rel="noreferrer">TET, the text extraction toolkit</a>. TET is part of the PDFlib.com family of products. 

PDFlib.com is Thomas Merz's company. In case you don't recognize his name: Thomas Merz is the author of the "PostScript and PDF Bible".

TET's first incarnation is <a href="http://www.pdflib.com/products/tet/" rel="noreferrer">a library</a>. That one can probably do everything Budda006 wanted, including positional information about every element on the page. Oh, and it can also extract images. It recombines images which are fragmented into pieces.

pdflib.com also offers another incarnation of this technology, the <a href="http://www.pdflib.com/products/tet-plugin/" rel="noreferrer">TET plugin for Acrobat</a>. And the third incarnation is the <a href="http://www.pdflib.com/products/tet-pdf-ifilter/" rel="noreferrer">PDFlib TET iFilter</a>. This is a standalone tool for user desktops. Both these are free (as in beer) to use for private, non-commercial purposes.

And it's really powerful. Way better than Adobe's own text extraction. It extracted text for me where other tools (including Adobe's) do spit out garbage only.

I just tested the desktop standalone tool, and what they say on their webpage is true. It has a very good commandline. Some of my "problematic" PDF test files the tool handled to my full satisfaction. 

This thing will from now on be my recommendation for every sophisticated and challenging PDF text extraction requirements.

TET is simply awesome. It detects tables. Inside tables, it identifies cells spanning multiple columns. It identifies table rows and contents of each table cell separately. It deals very well with hyphenations: it removes hyphens and restores complete words. It supports non-ASCII languages (including CJK, Arabic and Hebrew). When encountering ligatures, it restores the original characters...

Give it a try.

blocks|key|295266|text|Docotic.Pdf+library可以用来extract+text+from+PDF文件，作为纯文本或作为文本块的集合，每个块都有坐标。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|295267|Docotic.Pdf也可以用于extract+images+from+PDFs。|295268|免责声明:我为Bit+Miracle工作。|295269|entityMap|0|LINK|mutability|MUTABLE|url|https://bitmiracle.com/pdf-library/|1|https://bitmiracle.com/blog/extract-text-from-pdf-in-net|2|https://bitmiracle.com/pdf-library/help/extract-images.aspx^0|0|J|0|N|L|1|0|G|O|2|0|0^^$0|@$1|2|3|4|5|6|7|T|8|@]|9|@$A|U|B|V|1|W]|$A|X|B|Y|1|Z]]|C|$]]|$1|D|3|E|5|6|7|10|8|@]|9|@$A|11|B|12|1|13]]|C|$]]|$1|F|3|G|5|6|7|14|8|@]|9|@]|C|$]]|$1|H|3|-4|5|6|7|15|8|@]|9|@]|C|$]]]|I|$J|$5|K|L|M|C|$N|O]]|P|$5|K|L|M|C|$N|Q]]|R|$5|K|L|M|C|$N|S]]]]

<a href="https://bitmiracle.com/pdf-library/" rel="nofollow noreferrer">Docotic.Pdf library</a> may be used to <a href="https://bitmiracle.com/blog/extract-text-from-pdf-in-net" rel="nofollow noreferrer">extract text from PDF</a> files as plain text or as a collection of text chunks with coordinates for each chunk.
Docotic.Pdf can be used to <a href="https://bitmiracle.com/pdf-library/help/extract-images.aspx" rel="nofollow noreferrer">extract images from PDFs</a>, too.
Disclaimer: I work for Bit Miracle.

blocks|key|300448|text|PdfTextStream+(你说你一直在看的)现在是单线程应用程序的免费。在我看来，它的质量比其他库要好得多(特别是。例如时髦的嵌入式字体等)。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|300449|它在Java和C#中可用。|300450|或者，你应该看看开源的Apache+PDFBox。|300451|entityMap|0|LINK|mutability|MUTABLE|url|http://www.snowtide.com/|1|http://pdfbox.apache.org/^0|0|D|0|0|0|B|D|1|0^^$0|@$1|2|3|4|5|6|7|R|8|@]|9|@$A|S|B|T|1|U]]|C|$]]|$1|D|3|E|5|6|7|V|8|@]|9|@]|C|$]]|$1|F|3|G|5|6|7|W|8|@]|9|@$A|X|B|Y|1|Z]]|C|$]]|$1|H|3|-4|5|6|7|10|8|@]|9|@]|C|$]]]|I|$J|$5|K|L|M|C|$N|O]]|P|$5|K|L|M|C|$N|Q]]]]

<a href="http://www.snowtide.com/" rel="nofollow noreferrer">PdfTextStream</a> (which you said you have been looking at) is now free for single threaded applications. In my opinion its quality is much better than other libraries (esp. for things like funky embedded fonts, etc).
It is available in Java and C#.
Alternatively, you should have a look at <a href="http://pdfbox.apache.org/" rel="nofollow noreferrer">Apache PDFBox</a>, open source.

blocks|key|3238347|text|对于图像提取，pdfimages是一个适用于Linux或Windows+(Win32)的免费命令行工具：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3238348|pdfimages:+Extract+and+Save+Images+From+A+Portable+Document+Format+(+PDF+)+File|offset|length|3238349|entityMap|0|LINK|mutability|MUTABLE|url|http://www.cyberciti.biz/faq/easily-extract-images-from-pdf-file/^0|0|0|27|0|0^^$0|@$1|2|3|4|5|6|7|N|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|O|8|@]|9|@$D|P|E|Q|1|R]]|A|$]]|$1|F|3|-4|5|6|7|S|8|@]|9|@]|A|$]]]|G|$H|$5|I|J|K|A|$L|M]]]]

For image extraction, pdfimages is a free command line tool for Linux or Windows (win32):

<a href="http://www.cyberciti.biz/faq/easily-extract-images-from-pdf-file/" rel="nofollow">pdfimages: Extract and Save Images From A Portable Document Format ( PDF ) File</a>

blocks|key|3024072|text|对于python，有PDFMiner和pyPDF2。有关这些内容的更多信息，请参阅Python+module+for+converting+PDF+to+text。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|3024073|entityMap|0|LINK|mutability|MUTABLE|url|https://pypi.python.org/pypi/pdfminer/|1|https://github.com/mstamy2/PyPDF2|2|https://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text^0|A|8|0|J|6|1|15|14|2|0^^$0|@$1|2|3|4|5|6|7|P|8|@]|9|@$A|Q|B|R|1|S]|$A|T|B|U|1|V]|$A|W|B|X|1|Y]]|C|$]]|$1|D|3|-4|5|6|7|Z|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]|L|$5|G|H|I|C|$J|M]]|N|$5|G|H|I|C|$J|O]]]]

For python, there is <a href="https://pypi.python.org/pypi/pdfminer/" rel="noreferrer">PDFMiner</a> and <a href="https://github.com/mstamy2/PyPDF2" rel="noreferrer">pyPDF2</a>. For more information on these, see <a href="https://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text">Python module for converting PDF to text</a>.

blocks|key|3238433|text|我知道这个话题已经很久了，但这种需求仍然存在。我阅读了许多文档、论坛和脚本，并构建了一个新的支持压缩和解压pdf的高级版本：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3238434|https://gist.github.com/smalot/6183152|offset|length|3238435|在某些情况下，出于安全原因，命令行是被禁止的。因此，原生PHP类可以满足许多需求。|3238436|希望这对每个人都有帮助|3238437|entityMap|0|LINK|mutability|MUTABLE|url^0|0|0|12|0|0|0|0^^$0|@$1|2|3|4|5|6|7|Q|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|R|8|@]|9|@$D|S|E|T|1|U]]|A|$]]|$1|F|3|G|5|6|7|V|8|@]|9|@]|A|$]]|$1|H|3|I|5|6|7|W|8|@]|9|@]|A|$]]|$1|J|3|-4|5|6|7|X|8|@]|9|@]|A|$]]]|K|$L|$5|M|N|O|A|$P|C]]]]

I know that this topic is quite old, but this need is still alive. I read many documents, forum and script and build a new advanced one which supports compressed and uncompressed pdf :

<a href="https://gist.github.com/smalot/6183152" rel="nofollow">https://gist.github.com/smalot/6183152</a>

In some cases, command line is forbidden for security reasons.
So a native PHP class can fit many needs.

Hope it helps everone

blocks|key|3238478|text|这是我的建议。如果你想从PDF中提取文本，你可以将pdf文件导入Google+Docs，然后将其导出为更友好的格式，如.html，.odf，.rtf，.txt等。它是免费的*而且健壮。看一下：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3238479|https://developers.google.com/drive/v2/reference/files/insert+https://developers.google.com/drive/v2/reference/files/get|offset|length|3238480|因为它是rest+API，所以它与所有编程语言兼容。我在上面发布的链接有许多语言的工作示例，包括:+Java、.NET、Python、PHP、Ruby和其他语言。|3238481|我希望它能帮上忙。|3238482|entityMap|0|LINK|mutability|MUTABLE|url|https://developers.google.com/drive/v2/reference/files/insert|1|https://developers.google.com/drive/v2/reference/files/get^0|0|0|1P|0|1Q|1M|1|0|0|0^^$0|@$1|2|3|4|5|6|7|T|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|U|8|@]|9|@$D|V|E|W|1|X]|$D|Y|E|Z|1|10]]|A|$]]|$1|F|3|G|5|6|7|11|8|@]|9|@]|A|$]]|$1|H|3|I|5|6|7|12|8|@]|9|@]|A|$]]|$1|J|3|-4|5|6|7|13|8|@]|9|@]|A|$]]]|K|$L|$5|M|N|O|A|$P|Q]]|R|$5|M|N|O|A|$P|S]]]]

Here is my suggestion.
If you want to extract text from PDF, you could import the pdf file into Google Docs, then export it to a more friendly format such as .html, .odf, .rtf, .txt, etc. All of this using the Drive API. It is free* and robust. Take a look at:

<a href="https://developers.google.com/drive/v2/reference/files/insert">https://developers.google.com/drive/v2/reference/files/insert</a> <a href="https://developers.google.com/drive/v2/reference/files/get">https://developers.google.com/drive/v2/reference/files/get</a>

Because it is a rest API, it is compatible with ALL programing languages. The links I posted aboove have working examples for many languages including: Java, .NET, Python, PHP, Ruby, and others.

I hope it helps.

blocks|key|3238538|text|这里的一条评论使用了Windows上的gs。我在Linux/OSX上也取得了一些成功，语法如下：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3238539|gs+\
+-q+\
+-dNODISPLAY+\
+-dSAFER+\
+-dDELAYBIND+\
+-dWRITESYSTEMDICT+\
+-dSIMPLE+\
+-f+ps2ascii.ps+\
+"${input}"+\
+-dQUIET+\
+-c+quit|code-block|syntax|javascript|3238540|我使用dSIMPLE而不是dCOMPLEX，因为后者每行输出1个字符。|offset|length|style|CODE|3238541|entityMap^0|0|0|3|7|D|8|0^^$0|@$1|2|3|4|5|6|7|O|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|P|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|Q|8|@$I|R|J|S|K|L]|$I|T|J|U|K|L]]|9|@]|A|$]]|$1|M|3|-4|5|6|7|V|8|@]|9|@]|A|$]]]|N|$]]

One of the comments here used gs on Windows. I had some success with that on Linux/OSX too, with the following syntax:

<pre><code>gs \
 -q \
 -dNODISPLAY \
 -dSAFER \
 -dDELAYBIND \
 -dWRITESYSTEMDICT \
 -dSIMPLE \
 -f ps2ascii.ps \
 "${input}" \
 -dQUIET \
 -c quit
</code></pre>

I used <code>dSIMPLE</code> instead of <code>dCOMPLEX</code> because the latter outputs 1 character per line.

blocks|key|3024185|text|Apache+pdfbox具有此功能-文本部分在中进行了说明：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3024186|http://pdfbox.apache.org/apidocs/org/apache/pdfbox/util/PDFTextStripper.html|offset|length|3024187|有关示例实现，请参阅https://github.com/WolfgangFahl/pdfindexer|3024188|测试用例TestPdfIndexer.testExtracting展示了它是如何工作的|3024189|entityMap|0|LINK|mutability|MUTABLE|url|1|https://github.com/WolfgangFahl/pdfindexer^0|0|0|24|0|0|A|16|1|0|0^^$0|@$1|2|3|4|5|6|7|S|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|T|8|@]|9|@$D|U|E|V|1|W]]|A|$]]|$1|F|3|G|5|6|7|X|8|@]|9|@$D|Y|E|Z|1|10]]|A|$]]|$1|H|3|I|5|6|7|11|8|@]|9|@]|A|$]]|$1|J|3|-4|5|6|7|12|8|@]|9|@]|A|$]]]|K|$L|$5|M|N|O|A|$P|C]]|Q|$5|M|N|O|A|$P|R]]]]

Apache pdfbox has this feature - the text part is described in:

<a href="http://pdfbox.apache.org/apidocs/org/apache/pdfbox/util/PDFTextStripper.html" rel="nofollow">http://pdfbox.apache.org/apidocs/org/apache/pdfbox/util/PDFTextStripper.html</a>

for an example implementation see
<a href="https://github.com/WolfgangFahl/pdfindexer" rel="nofollow">https://github.com/WolfgangFahl/pdfindexer</a>

the testcase TestPdfIndexer.testExtracting shows how it works

blocks|key|3238580|text|一个高效的命令行工具，开源，免费，在linux和windows上都可以使用:简单地命名为pdftotext。此工具是xpdf库的一部分。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3238581|http://en.wikipedia.org/wiki/Pdftotext|offset|length|3238582|entityMap|0|LINK|mutability|MUTABLE|url^0|0|0|12|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|N|8|@]|9|@$D|O|E|P|1|Q]]|A|$]]|$1|F|3|-4|5|6|7|R|8|@]|9|@]|A|$]]]|G|$H|$5|I|J|K|A|$L|C]]]]

An efficient command line tool, open source, free of any fee, available on both linux &amp; windows : simply named pdftotext. This tool is a part of the xpdf library. 

<a href="http://en.wikipedia.org/wiki/Pdftotext" rel="noreferrer">http://en.wikipedia.org/wiki/Pdftotext</a>

blocks|key|297945|text|我得到了一个400页的pdf文件，其中有一个数据表，我必须导入-幸运的是没有图像。Ghostscript为我工作：|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|297946|gswin64c+-sDEVICE=txtwrite+-o+output.txt+input.pdf|style|CODE|297947|输出文件被拆分成带有标题等的页面，但随后很容易编写一个应用程序来剔除空行等，并吸收所有30,000条记录。在这种情况下，-dSIMPLE和-dCOMPLEX没有区别。|297948|entityMap|0|LINK|mutability|MUTABLE|url|http://ghostscript.com/download/^0|15|B|0|0|0|1E|0|1O|8|1X|9|0^^$0|@$1|2|3|4|5|6|7|R|8|@]|9|@$A|S|B|T|1|U]]|C|$]]|$1|D|3|E|5|6|7|V|8|@$A|W|B|X|F|G]]|9|@]|C|$]]|$1|H|3|I|5|6|7|Y|8|@$A|Z|B|10|F|G]|$A|11|B|12|F|G]]|9|@]|C|$]]|$1|J|3|-4|5|6|7|13|8|@]|9|@]|C|$]]]|K|$L|$5|M|N|O|C|$P|Q]]]]

I was given a 400 page pdf file with a table of data that I had to import - luckily no images. <a href="http://ghostscript.com/download/">Ghostscript</a> worked for me: 

<code>gswin64c -sDEVICE=txtwrite -o output.txt input.pdf</code>

The output file was split into pages with headers, etc., but it was then easy to write an app to strip out blank lines, etc, and suck in all 30,000 records. <code>-dSIMPLE</code> and <code>-dCOMPLEX</code> made no difference in this case.

blocks|key|300595|text|在我的Macintosh系统上，我发现"Adobe+Reader“做得相当不错。我在我的桌面上创建了一个别名，指向"Adobe+Reader.app"，我所做的就是将一个pdf文件放到别名上，这使它成为Adobe+Reader中的活动文档，然后从文件菜单中，我选择“另存为文本...”，给它一个名称和保存它的位置，点击“保存”，然后我就完成了。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|300596|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

On my Macintosh systems, I find that "Adobe Reader" does a reasonably good job. I created an alias on my Desktop that points to the "Adobe Reader.app", and all I do is drop a pdf-file on the alias, which makes it the active document in Adobe Reader, and then from the File-menu, I choose "Save as Text...", give it a name and where to save it, click "Save", and I'm done.

blocks|key|3238699|text|因为这个问题是关于从PDF中以XML格式获取数据的替代工具，所以您可能有兴趣查看一下能够做到这一点的商业工具"ByteScout+PDF+Extractor+SDK"：从PDF中提取以XML格式的文本以及定位数据(x，y)和字体信息：|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|3238700|源PDF中的文本：|3238701|Products+%7C+Units+%7C+Price+|code-block|syntax|javascript|3238702|输出XML：|3238703|+<row>
+<column>
++<text+fontName="Arial"+fontSize="11.0"+fontStyle="Bold"+x="212"+y="126"+width="47"+height="11">Products</text>+
++</column>
+<column>
++<text+fontName="Arial"+fontSize="11.0"+fontStyle="Bold"+x="428"+y="126"+width="27"+height="11">Units</text>+
++</column>
+<column>
++<text+fontName="Arial"+fontSize="11.0"+fontStyle="Bold"+x="503"+y="126"+width="26"+height="11">Price</text>+
++</column>
</row>|3238704|附注:此外，它还将文本分解为基于表格的结构。|3238705|披露:我为ByteScout工作|3238706|entityMap|0|LINK|mutability|MUTABLE|url|http://bytescout.com/products/developer/pdfextractorsdk/extract-pdf-to-xml^0|1I|T|0|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|10|8|@]|9|@$A|11|B|12|1|13]]|C|$]]|$1|D|3|E|5|6|7|14|8|@]|9|@]|C|$]]|$1|F|3|G|5|H|7|15|8|@]|9|@]|C|$I|J]]|$1|K|3|L|5|6|7|16|8|@]|9|@]|C|$]]|$1|M|3|N|5|H|7|17|8|@]|9|@]|C|$I|J]]|$1|O|3|P|5|6|7|18|8|@]|9|@]|C|$]]|$1|Q|3|R|5|6|7|19|8|@]|9|@]|C|$]]|$1|S|3|-4|5|6|7|1A|8|@]|9|@]|C|$]]]|T|$U|$5|V|W|X|C|$Y|Z]]]]

As the question is specifically about alternative tools to get data from PDF as XML so you may be interested to take a look at the commercial tool <a href="http://bytescout.com/products/developer/pdfextractorsdk/extract-pdf-to-xml" rel="nofollow noreferrer">"ByteScout PDF Extractor SDK"</a> that is capable of doing exactly this: extract text from PDF as XML along with the positioning data (x,y) and font information:

Text in the source PDF: 


<pre class="lang-html prettyprint-override"><code>Products | Units | Price 
</code></pre>



Output XML:



<pre class="lang-html prettyprint-override"><code> &lt;row&gt;
 &lt;column&gt;
 &lt;text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="212" y="126" width="47" height="11"&gt;Products&lt;/text&gt; 
 &lt;/column&gt;
 &lt;column&gt;
 &lt;text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="428" y="126" width="27" height="11"&gt;Units&lt;/text&gt; 
 &lt;/column&gt;
 &lt;column&gt;
 &lt;text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="503" y="126" width="26" height="11"&gt;Price&lt;/text&gt; 
 &lt;/column&gt;
&lt;/row&gt;
</code></pre>



P.S.: additionally it also breaks the text into a table based structure.

Disclosure: I work for ByteScout

Can anyone recommend a library/API for extracting the text and images from a PDF?
We need to be able to get at text that is contained in pre-known regions of the document, so the API will need to give us positional information of each element on the page. 

We would like that data to be output in <code>xml</code> or <code>json</code> format. We're currently looking at PdfTextStream which seems pretty good, but would like to hear other peoples experiences and suggestions.

Are there alternatives (commercial ones or free) for extracting text from a pdf programatically?

How to extract text from a PDF?

有人能推荐一个库/API来从PDF中提取文本和图像吗？我们需要能够获取文档中预先知道的区域中包含的文本，因此API需要为我们提供页面上每个元素的位置信息。我们希望数据以xml或json格式输出。我们目前正在研究PdfTextStream，它看起来很不错，但也想听听其他人的经验和建议。有没有其他的方法(商业的或者免费的)...

问如何从PDF中提取文本？
EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从PDF中提取文本？EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从PDF中提取文本？
EN