blocks|key|1152530|text|小菜是从任意PDF中提取CSV/TSV表的JRuby+web接口的一个很好的开端。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1152531|entityMap|0|LINK|mutability|MUTABLE|url|http://tabula.nerdpower.org/^0|0|2|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@$A|M|B|N|1|O]]|C|$]]|$1|D|3|-4|5|6|7|P|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]]]

<a href="http://tabula.nerdpower.org/">Tabula</a> is a pretty good start on a JRuby web interface for extracting CSV/TSV tables from arbitrary PDFs.

blocks|key|1152640|text|我已经实现了我自己的算法(它的名字是traprange+)来解析pdf文件中的表格数据。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1152641|以下是一些pdf样本文件和结果：|1152642|输入文件：sample-1.pdf，结果：sample-1.html|ordered-list-item|1152643|输入文件：sample-4.pdf，结果：sample-4.html|1152644|访问我在特朗的项目页面|1152645|或者我在特朗的文章|1152646|entityMap|0|LINK|mutability|MUTABLE|url|https://github.com/thoqbk/traprange/blob/master/_Docs/sample-1.pdf|1|http://htmlpreview.github.io/?https://github.com/thoqbk/traprange/blob/master/_Docs/result/sample-1.html|2|https://github.com/thoqbk/traprange/blob/master/_Docs/sample-4.pdf|3|http://htmlpreview.github.io/?https://github.com/thoqbk/traprange/blob/master/_Docs/result/sample-4.html|4|https://github.com/thoqbk/traprange|5|http://www.dzone.com/articles/traprange-method-extract-table^0|I|9|0|0|5|C|0|L|D|1|0|5|C|2|L|D|3|0|4|2|4|0|4|2|5|0^^$0|@$1|2|3|4|5|6|7|18|8|@$9|19|A|1A|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|1B|8|@]|D|@]|E|$]]|$1|H|3|I|5|J|7|1C|8|@]|D|@$9|1D|A|1E|1|1F]|$9|1G|A|1H|1|1I]]|E|$]]|$1|K|3|L|5|J|7|1J|8|@]|D|@$9|1K|A|1L|1|1M]|$9|1N|A|1O|1|1P]]|E|$]]|$1|M|3|N|5|6|7|1Q|8|@]|D|@$9|1R|A|1S|1|1T]]|E|$]]|$1|O|3|P|5|6|7|1U|8|@]|D|@$9|1V|A|1W|1|1X]]|E|$]]|$1|Q|3|-4|5|6|7|1Y|8|@]|D|@]|E|$]]]|R|$S|$5|T|U|V|E|$W|X]]|Y|$5|T|U|V|E|$W|Z]]|10|$5|T|U|V|E|$W|11]]|12|$5|T|U|V|E|$W|13]]|14|$5|T|U|V|E|$W|15]]|16|$5|T|U|V|E|$W|17]]]]

I have implemented my own algorithm ( its name is <code>traprange</code> ) to parse tabular data in pdf files. 

Following are some sample pdf files and results: 

<ol>
<li>Input file: <a href="https://github.com/thoqbk/traprange/blob/master/_Docs/sample-1.pdf" rel="noreferrer">sample-1.pdf</a>, result: <a href="http://htmlpreview.github.io/?https://github.com/thoqbk/traprange/blob/master/_Docs/result/sample-1.html" rel="noreferrer">sample-1.html</a></li>
<li>Input file: <a href="https://github.com/thoqbk/traprange/blob/master/_Docs/sample-4.pdf" rel="noreferrer">sample-4.pdf</a>, result: <a href="http://htmlpreview.github.io/?https://github.com/thoqbk/traprange/blob/master/_Docs/result/sample-4.html" rel="noreferrer">sample-4.html</a></li>
</ol>

Visit my project page at <a href="https://github.com/thoqbk/traprange" rel="noreferrer">traprange</a>

or my article at <a href="http://www.dzone.com/articles/traprange-method-extract-table" rel="noreferrer">traprange</a>

blocks|key|673346|text|您可以使用Camelot从PDF中提取表并将其导出到HTML文件中。还支持CSV、Excel和JSON。您可以在：http://camelot-py.readthedocs.io查阅文档。与其他开源表格抽取工具和库相比，它提供了更准确的结果。这是一个比较。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|673347|您可以使用以下代码片段继续执行任务：|673348|>>>+import+camelot
>>>+tables+=+camelot.read_pdf('file.pdf')
>>>+type(tables[0].df)
<class+'pandas.core.frame.DataFrame'>
>>>+tables[0].to_html('file.html')|code-block|syntax|javascript|673349|免责声明:我是图书馆的作者。|673350|entityMap|0|LINK|mutability|MUTABLE|url|http://camelot-py.readthedocs.io/|1|https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools^0|1L|W|0|3H|2|1|0|0|0|0^^$0|@$1|2|3|4|5|6|7|W|8|@]|9|@$A|X|B|Y|1|Z]|$A|10|B|11|1|12]]|C|$]]|$1|D|3|E|5|6|7|13|8|@]|9|@]|C|$]]|$1|F|3|G|5|H|7|14|8|@]|9|@]|C|$I|J]]|$1|K|3|L|5|6|7|15|8|@]|9|@]|C|$]]|$1|M|3|-4|5|6|7|16|8|@]|9|@]|C|$]]]|N|$O|$5|P|Q|R|C|$S|T]]|U|$5|P|Q|R|C|$S|V]]]]

You can use Camelot to extract tables from your PDF and export it to an HTML file. CSV, Excel and JSON are also supported. You can check out the documentation at: <a href="http://camelot-py.readthedocs.io" rel="noreferrer">http://camelot-py.readthedocs.io</a>. It gives more accurate results as compared to other open-source table extraction tools and libraries. Here's a <a href="https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools" rel="noreferrer">comparison</a>.

You can use the following code snippet to go forward with your task:

<pre><code>&gt;&gt;&gt; import camelot
&gt;&gt;&gt; tables = camelot.read_pdf('file.pdf')
&gt;&gt;&gt; type(tables[0].df)
&lt;class 'pandas.core.frame.DataFrame'&gt;
&gt;&gt;&gt; tables[0].to_html('file.html')
</code></pre>

Disclaimer: I'm the author of the library.

blocks|key|673209|text|如果您希望每周从表中提取一次数据，并且您在Windows上，请检查这个免费的pdf实用程序，其中包括自动的表检测和表到CSV，XML转换：PDF查看器实用程序。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|673210|该实用程序对于非商业和非商业用途都是免费的(对于希望通过API实现自动化的开发人员也有单独的版本)。|673211|免责声明:我为ByteScout工作|673212|entityMap|0|LINK|mutability|MUTABLE|url|https://bytescout.com/products/enduser/misc/pdfviewer.html^0|1X|A|0|0|0|0^^$0|@$1|2|3|4|5|6|7|P|8|@]|9|@$A|Q|B|R|1|S]]|C|$]]|$1|D|3|E|5|6|7|T|8|@]|9|@]|C|$]]|$1|F|3|G|5|6|7|U|8|@]|9|@]|C|$]]|$1|H|3|-4|5|6|7|V|8|@]|9|@]|C|$]]]|I|$J|$5|K|L|M|C|$N|O]]]]

If you are looking to extract data from tables once a week and you are on Windows then, please check this freeware pdf utility that includes automated table detection and table to CSV, XML conversion: <a href="https://bytescout.com/products/enduser/misc/pdfviewer.html" rel="nofollow">PDF Viewer utility</a>.

The utility is free for both commercial and non-commercial usage for non-developers (and there is the separate version for developers who want to automate via API).

Disclaimer: I work for ByteScout

blocks|key|1152464|text|我已经尝试过许多OCR和文本转换软件，虽然我相信有一次应该编写程序，将PDF转换成文本，因为执行任务的人更能理解图像。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1152465|我还尝试使用谷歌和许多其他在线(大约900个网站)和离线(大约1000个软件)产品由不同的公司。如果您想从任何方法(如OCR或PDF文档中提取文本)中提取文本，那么我找到的最精确的程序是PDFTOHTML。PDFTOHTML的准确率约为98%25，Google的准确率约为94%25。这是一个非常好的软件，它也为您提供了正确的文本格式，即粗体，斜体等文本。|offset|length|1152466|entityMap|0|LINK|mutability|MUTABLE|url|http://sourceforge.net/projects/pdftohtml|1^0|0|2L|9|0|2V|9|1|0^^$0|@$1|2|3|4|5|6|7|O|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|P|8|@]|9|@$D|Q|E|R|1|S]|$D|T|E|U|1|V]]|A|$]]|$1|F|3|-4|5|6|7|W|8|@]|9|@]|A|$]]]|G|$H|$5|I|J|K|A|$L|M]]|N|$5|I|J|K|A|$L|M]]]]

I have tried many of the OCR and text converter software's and though I believe once should write the program self converting PDF to text as the Image is better understood by the person performing task.

I had also tried to use Google and many other Online (about 900 website) and Offline(about 1000 softwares) products by different companies. If you want to extract text from any method such as OCR or Text from PDF, then most accurate program I found is <a href="http://sourceforge.net/projects/pdftohtml" rel="nofollow">PDFTOHTML</a>. The accuracy rate of <a href="http://sourceforge.net/projects/pdftohtml" rel="nofollow">PDFTOHTML</a> is about 98% and Google Online has about 94% accuracy. It is a very good software which also provide you the correct format of text i.e. bold, italic etc of the text.

blocks|key|569125|text|对于主要模板，Tabula是开源的最佳选择，而Abbyy+编辑器是企业级pdf数据提取和修改的一个很好的解决方案。Abbyy在OCR上工作。|type|unstyled|depth|inlineStyleRanges|offset|length|style|BOLD|entityRanges|data|569126|Tabula有两个自动表检测选项，另一个选项是手动提供坐标。|569127|entityMap^0|7|6|N|6|0|0^^$0|@$1|2|3|4|5|6|7|J|8|@$9|K|A|L|B|C]|$9|M|A|N|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|O|8|@]|D|@]|E|$]]|$1|H|3|-4|5|6|7|P|8|@]|D|@]|E|$]]]|I|$]]

for major templates Tabula is the best option for open source while Abbyy PDF editor is a great solution for enterprise-level pdf data extraction and modification. Abbyy works on OCR.

Tabula have two option for auto table detection and another is manually by providing coordinates.

blocks|key|569022|text|每张桌子都在同一个地方吗？如果可以找到每个框的维度，则可以使用工具将PDF拆分为多个文档，每个文档包含一个框，然后可以使用任何您想要将每个较小的PDF转换为HTML的工具(例如其他答案中提到的工具)。谷歌随机搜索了PyPdf，看起来它可能有一些有用的功能。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|569023|如果您无法对框的大小进行硬编码(或者希望将问题应用于不同格式的多个菜单)，那么对我来说最明显的方法(我说过很明显，不容易)是边缘检测，以找到表的边框所在的位置，然后应用我之前讨论过的拆分。|569024|entityMap|0|LINK|mutability|MUTABLE|url|http://pybrary.net/pyPdf/^0|2Z|5|0|0|0^^$0|@$1|2|3|4|5|6|7|N|8|@]|9|@$A|O|B|P|1|Q]]|C|$]]|$1|D|3|E|5|6|7|R|8|@]|9|@]|C|$]]|$1|F|3|-4|5|6|7|S|8|@]|9|@]|C|$]]]|G|$H|$5|I|J|K|C|$L|M]]]]

Are the tables in the same place each time? If you can find the dimentions of each box, you could use a tool to split the PDF into multiple documents, each of which contain one box, after which you can use whatever tool you want to convert each smaller PDF to HTML (such as the tools mentioned in other answers). Random Google searches pulled up <a href="http://pybrary.net/pyPdf/" rel="nofollow">PyPdf</a>, which looked like it might have some useful functions.

If you aren't able to hard code the size of the box (or want to apply the problem to multiple menus in different formats), the obvious method to me (I said obvious, not easy) would be edge detection to find where the border of the table would be, and then apply the splitting I talked about before.

blocks|key|673294|text|我最近遇到了一个类似的问题。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|673295|我发现的另一种解决方案是在Adobe中打开PDF文档并将其导出到xml。至少用我的PDF格式，它保存了表信息，然后我能够编程地使用XML生成表格文件，如excel等。|673296|我遇到的另一个问题是Adobe只允许您一次导出一个文件，而我有很多文件。幸运的是，Adobe还有一个合并功能。最后，我将所有文件合并在一起，然后将它们导出为一个大XML文件，并使用该文件生成所需的内容。|673297|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|H|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|I|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|J|8|@]|9|@]|A|$]]|$1|F|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|G|$]]

I recently ran into a similar problem.

An alternate solution I found was to open a PDF document in Adobe and export it to xml. At least with my PDF's it preserved the table information and then I was able to programmatically work with the XML to generate tabular files like excel etc.

The other issue I ran into was that Adobe only lets you export one file at a time and I had lots of files. Luckily Adobe also has a merge function. I ended up merging all the files together and then exporting them as one big XML file and working with that file to generate what I needed.

I have (same) data saved as a GIF image file and as a PDF file and I want to parse it to HTML or XML. The data is actually the menu for my university's cafeteria. That means that there is a new version of the file that has to be parsed each week! 
In General, the files contain some header and footer text, as well as a table full of other data in between.
I have read some posts on stackoverflow and I also had started some attempts to parse out the table data as HTML/XML:

PDF

<ul>
<li>PDFBox || iText (Java)</li>
<li>Google Docs Import</li>
<li>PDF2HTML || PDF2Table</li>
</ul>

GIF

<ul>
<li>Tesseract-OCR</li>
</ul>

I have got the best result from parsing the PDF-file with PDFBox, but still (as the menu changes weekly), it is not reliable enough. The HTML that I receive includes sometimes more, sometimes less "paragraphs" (<code>&lt;p&gt;</code>), so that I am not able to parse the data precice enough.

That is why I would like to know if there is an other way to do it?

PDF table extraction

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我将(相同的)数据保存为GIF图像文件和PDF文件，并希望将其解析为HTML或XML。这些数据实际上是我校食堂的菜单。这意味着有一个新版本的文件，必须分析每周！通常，文件包含一些页眉和页脚文本，以及一个包含其他数据的表。我阅读了一些关于堆栈溢出的文章，并开始尝试将表数据解析为HTML/XML：PDF格式PDFBox = iText (Java)谷歌文档导入PDF2HTML _~_GIFTesser

问PDF表格提取
EN

回答 8

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问PDF表格提取EN

回答 8

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问PDF表格提取
EN