blocks|key|996982|text|尝尝这个|type|unstyled|depth|inlineStyleRanges|entityRanges|data|996983|http://www.codeproject.com/KB/cs/PDFToText.aspx|offset|length|996984|再见|996985|entityMap|0|LINK|mutability|MUTABLE|url^0|0|0|1B|0|0|0^^$0|@$1|2|3|4|5|6|7|O|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|P|8|@]|9|@$D|Q|E|R|1|S]]|A|$]]|$1|F|3|G|5|6|7|T|8|@]|9|@]|A|$]]|$1|H|3|-4|5|6|7|U|8|@]|9|@]|A|$]]]|I|$J|$5|K|L|M|A|$N|C]]]]

try this

<a href="http://www.codeproject.com/KB/cs/PDFToText.aspx" rel="nofollow noreferrer">http://www.codeproject.com/KB/cs/PDFToText.aspx</a>

Bye

blocks|key|996484|text|pdftotext似乎做得很好。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|996485|pdftotext+file.pdf+[textfile.txt]|code-block|syntax|javascript|996486|编辑：我不确定您希望如何保留有关表的信息。最好看的输出(至少在我的人眼看来)是由|BOLD|996487|pdftotext+-layout+file.pdf+[textfile.txt]|996488|这将尽可能地维护文档的原始布局。特别是，这些表在文本输出中仍然很好。默认情况下，将表的列解释为文本列(可怕)。另一个选项在我看来不太好，但可能仍然很有用，那就是-raw选项。|996489|entityMap|0|LINK|mutability|MUTABLE|url|http://en.wikipedia.org/wiki/Pdftotext^0|0|9|0|9|0|0|0|0|2|0|0|28|4|0^^$0|@$1|2|3|4|5|6|7|Z|8|@$9|10|A|11|B|C]]|D|@$9|12|A|13|1|14]]|E|$]]|$1|F|3|G|5|H|7|15|8|@]|D|@]|E|$I|J]]|$1|K|3|L|5|6|7|16|8|@$9|17|A|18|B|M]]|D|@]|E|$]]|$1|N|3|O|5|H|7|19|8|@]|D|@]|E|$I|J]]|$1|P|3|Q|5|6|7|1A|8|@$9|1B|A|1C|B|C]]|D|@]|E|$]]|$1|R|3|-4|5|6|7|1D|8|@]|D|@]|E|$]]]|S|$T|$5|U|V|W|E|$X|Y]]]]

<a href="http://en.wikipedia.org/wiki/Pdftotext" rel="nofollow noreferrer"><code>pdftotext</code></a> seems to do the trick quite nicely.

<pre><code>pdftotext file.pdf [textfile.txt]
</code></pre>

Edit: I'm not sure how you would like to retain information about the tables. The best looking output (to my human eye, at least) is produced by

<pre><code>pdftotext -layout file.pdf [textfile.txt]
</code></pre>

This maintains the original layout of the document as best as possible. In particular, the tables still look pretty good in the text output. The default is to interpret the columns of the table as columns of text (terrible). Another option that doesn't look as good to me, but might still be useful, is the <code>-raw</code> option.

blocks|key|997025|text|我不能提供解决方案，只能提供一般性的建议。我给您的建议是在记事本或其他纯文本编辑器中打开PDF文档并研究格式代码。它们很容易理解。例如，//par是段，//tab是Tab。一旦您知道了表布局的格式代码，您就很容易想出自己的解决方案，从PDF文档中提取任何内容。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|997026|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

I can't provide a solution but only offer general advice. My advice to you is to open a PDF document in Notepad or another Plain Text editor and study the formatting codes. They're very easy to understand. For example, //par is a Paragraph and //tab is a Tab. Once you know the formatting codes for table layouts, it'll be very easy for you to come up with your own solution to extract anything from a PDF document.

blocks|key|2457150|text|Java上也有PdfBox和JPedal。表格不存在PDF文件格式，所以任何软件都会“猜测”它们。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2457151|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

There is also PdfBox and JPedal on Java. Tables do not exist in the PDF file format so any software will be 'guessing' them.

blocks|key|996525|text|阿帕奇蒂卡是一个开源的Java工具包，专门研究您要寻找的内容:从各种文档(包括pdf+)中提取结构化上下文。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|996526|它确实使用了PDFBox格式的pdf文件格式，但提供了抽象级别，是提取结构化上下文的理想选择。|996527|它包含命令行实用程序--参见这里。|996528|entityMap|0|LINK|mutability|MUTABLE|url|http://lucene.apache.org/tika/|1|http://lucene.apache.org/tika/gettingstarted.html^0|0|5|0|0|0|E|2|1|0^^$0|@$1|2|3|4|5|6|7|R|8|@]|9|@$A|S|B|T|1|U]]|C|$]]|$1|D|3|E|5|6|7|V|8|@]|9|@]|C|$]]|$1|F|3|G|5|6|7|W|8|@]|9|@$A|X|B|Y|1|Z]]|C|$]]|$1|H|3|-4|5|6|7|10|8|@]|9|@]|C|$]]]|I|$J|$5|K|L|M|C|$N|O]]|P|$5|K|L|M|C|$N|Q]]]]

<a href="http://lucene.apache.org/tika/" rel="nofollow noreferrer">Apache Tika</a> is open-source Java toolkit that specializes in what you are looking for: extracting structured context from various documents including pdf. 

It does use PDFBox for pdf file format but provides level of abstraction that is ideal for extracting structured context.

It contains command line utility - see <a href="http://lucene.apache.org/tika/gettingstarted.html" rel="nofollow noreferrer">here</a>.

blocks|key|996559|text|PDF中的表格数据通常很难正确提取，因为大多数PDF文件都不包含结构化的内容元数据。而没有这种元数据，PDF文件只是一堆文本和其他操作。大多数情况下，只有人才能说出文档中是否有表格。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|996560|几乎所有足够先进的工具和库都试图使用启发式方法构造从PDF中提取的文本。当然，结果因工具而异，也因库而异。|996561|您可以尝试Docotic.Pdf库+(免责声明:我为位奇迹工作)从PDF文件中提取文本。我认为，图书馆应提取质量足以进一步处理的文本。|offset|length|996562|请看一个显示如何从PDF中提取文本的示例。|996563|entityMap|0|LINK|mutability|MUTABLE|url|https://bitmiracle.com/pdf-library/|1|https://bitmiracle.com/blog/extract-text-from-pdf-in-net^0|0|0|5|C|0|0|6|B|1|0^^$0|@$1|2|3|4|5|6|7|T|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|U|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|V|8|@]|9|@$F|W|G|X|1|Y]]|A|$]]|$1|H|3|I|5|6|7|Z|8|@]|9|@$F|10|G|11|1|12]]|A|$]]|$1|J|3|-4|5|6|7|13|8|@]|9|@]|A|$]]]|K|$L|$5|M|N|O|A|$P|Q]]|R|$5|M|N|O|A|$P|S]]]]

Tabular data in PDF are usually hard to extract properly because most of PDF files out there do not contain Structured Content metadata. And without this metadata PDF files a just a pile of text and other operations. Most of the times only human can say if there is a table in a document.
Almost any sufficiently advanced tools and libraries try to structure text extracted from PDF in some way using heuristics. Results of course vary from tool to tool and from library to library.
You can try <a href="https://bitmiracle.com/pdf-library/" rel="nofollow noreferrer">Docotic.Pdf library</a> (disclaimer: I work for Bit Miracle) to extract text from PDF files. I think that the library should extract text with quality sufficient to further processing.
Please take a look at a sample that shows <a href="https://bitmiracle.com/blog/extract-text-from-pdf-in-net" rel="nofollow noreferrer">how to extract text from PDF</a>.

blocks|key|997012|text|试试开放源码java+pdf库。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|997013|http://www.lowagie.com/iText/docs.html|offset|length|997014|entityMap|0|LINK|mutability|MUTABLE|url^0|0|0|12|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|N|8|@]|9|@$D|O|E|P|1|Q]]|A|$]]|$1|F|3|-4|5|6|7|R|8|@]|9|@]|A|$]]]|G|$H|$5|I|J|K|A|$L|C]]]]

try the opensource java pdf library

<a href="http://www.lowagie.com/iText/docs.html" rel="nofollow noreferrer">http://www.lowagie.com/iText/docs.html</a>

I need to extract the text from a PDF file. This text will likely be in a table format, and it is going to be used for automatic transfer of data between an external party and our systems.

Can anyone suggest a command line tool (eg pdf to txt) or a library that would be good for this? 

Language options:

<ul>
<li>C# (preferred)</li>
<li>Java (if I must)</li>
</ul>

I found some ideas here, but i think the guy was talking more about a one-off situation, i'm talking more like a daily import:

<a href="https://stackoverflow.com/questions/488089/extracting-tables-from-pdf-files">https://stackoverflow.com/questions/488089/extracting-tables-from-pdf-files</a>

Extracting text from a PDF file

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我需要从PDF文件中提取文本。该文本可能是表格格式，它将用于外部方和我们的系统之间的数据自动传输。有人能建议一个命令行工具(如pdf到txt)或者一个对此有好处的库吗？语文选择：C# (首选)Java (如果我必须的话)我在这里发现了一些想法，但我认为这家伙更多的是在谈论一次性的情况，我说的更像是日常生活：

问从PDF文件中提取文本
EN

回答 7

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从PDF文件中提取文本EN

回答 7

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从PDF文件中提取文本
EN