blocks|key|2840104|text|你可以看看toxy.codeplex.com。Toxy是一个纯.NET文本提取框架。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|2840105|使用Toxy非常简单。例如，提取一个名为test.xlsx的电子表格文件。|2840106|ParserContext+context+=+new+ParserContext("test.xlsx");
ISpreadsheetParser+parser+=+ParserFactory.CreateSpreadsheet(context);
ToxySpreadsheet+ss+=+parser.Parse();
//then+you+can+start+handle+the+result+-+a+ToxySpreadsheet+object|code-block|syntax|javascript|2840107|entityMap|0|LINK|mutability|MUTABLE|url|http://toxy.codeplex.com/^0|5|H|0|0|0|0^^$0|@$1|2|3|4|5|6|7|S|8|@]|9|@$A|T|B|U|1|V]]|C|$]]|$1|D|3|E|5|6|7|W|8|@]|9|@]|C|$]]|$1|F|3|G|5|H|7|X|8|@]|9|@]|C|$I|J]]|$1|K|3|-4|5|6|7|Y|8|@]|9|@]|C|$]]]|L|$M|$5|N|O|P|C|$Q|R]]]]

You can take a look at <a href="http://toxy.codeplex.com">toxy.codeplex.com</a>. Toxy is a pure .NET text extraction framework. 

It's very simple to use Toxy. For example, to extract a Excel spreadsheet file called test.xlsx.

<pre><code>ParserContext context = new ParserContext("test.xlsx");
ISpreadsheetParser parser = ParserFactory.CreateSpreadsheet(context);
ToxySpreadsheet ss = parser.Parse();
//then you can start handle the result - a ToxySpreadsheet object
</code></pre>

blocks|key|1610807|text|下面是从word文档中提取的链接：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1610808|How+to+extract+text+from+MS+office+documents+in+C#|offset|length|1610809|对于pdf，我会使用PDFsharp，它是开源的，在他们的网站上有一些很好的例子。|1610810|http://pdfsharp.com/PDFsharp/|1610811|entityMap|0|LINK|mutability|MUTABLE|url|https://stackoverflow.com/questions/1011234/how-to-extract-text-from-ms-office-documents-in-c-sharp|1^0|0|0|1E|0|0|0|0|T|1|0^^$0|@$1|2|3|4|5|6|7|S|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|T|8|@]|9|@$D|U|E|V|1|W]]|A|$]]|$1|F|3|G|5|6|7|X|8|@]|9|@]|A|$]]|$1|H|3|I|5|6|7|Y|8|@]|9|@$D|Z|E|10|1|11]]|A|$]]|$1|J|3|-4|5|6|7|12|8|@]|9|@]|A|$]]]|K|$L|$5|M|N|O|A|$P|Q]]|R|$5|M|N|O|A|$P|I]]]]

Here's a link to extracting from word document:

<a href="https://stackoverflow.com/questions/1011234/how-to-extract-text-from-ms-office-documents-in-c-sharp">How to extract text from MS office documents in C#</a>

and for the pdf I would use PDFsharp, it is open source and has some good examples and such on their website:

<a href="http://pdfsharp.com/PDFsharp/" rel="nofollow noreferrer">http://pdfsharp.com/PDFsharp/</a>

blocks|key|2840020|text|对于从pdf文件中提取文本，itextsharp是非常棒的。它是免费的，并且是开源的。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|2840021|要从pdf中读取文本，使用这个库非常容易。|2840022|entityMap|0|LINK|mutability|MUTABLE|url|http://sourceforge.net/projects/itextsharp/^0|E|A|0|0|0^^$0|@$1|2|3|4|5|6|7|N|8|@]|9|@$A|O|B|P|1|Q]]|C|$]]|$1|D|3|E|5|6|7|R|8|@]|9|@]|C|$]]|$1|F|3|-4|5|6|7|S|8|@]|9|@]|C|$]]]|G|$H|$5|I|J|K|C|$L|M]]]]

For text extracting from pdf <a href="http://sourceforge.net/projects/itextsharp/" rel="nofollow">itextsharp</a> is awesome. it is free and open source.

to read text from pdf it is very easy using this library.

blocks|key|1610881|text|为此，我推荐Aspose+Total。几年前，我做了一个项目，几乎完全符合您的要求，并且与在不同版本的Office之间使用Office+Interop相比(在更改为XML之前)，Aspose是最健壮的库。你也可能需要根据你所说的内容来做一些OCR。这不是便宜的，但我发现他们的应用程序接口非常可靠，它可以在你所询问的大多数文件类型的版本上工作。您应该能够使用免费试用，看看它是否适合您的项目。除了在生产环境中使用他们的工具之外，我与Aspose没有任何联系。|type|unstyled|depth|inlineStyleRanges|offset|length|style|BOLD|entityRanges|data|1610882|Aspose+Total|1610883|entityMap|0|LINK|mutability|MUTABLE|url|http://www.aspose.com/categories/.net-components/aspose.total-for-.net/default.aspx^0|3G|6|0|0|C|0|0^^$0|@$1|2|3|4|5|6|7|P|8|@$9|Q|A|R|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|S|8|@]|D|@$9|T|A|U|1|V]]|E|$]]|$1|H|3|-4|5|6|7|W|8|@]|D|@]|E|$]]]|I|$J|$5|K|L|M|E|$N|O]]]]

I would recommend Aspose Total for this. A few years ago I did a project on doing pretty much exactly what you are asking and compared to using the Office Interop stuff between different versions of Office (Prior to the change to XML) Aspose was the most robust library. You will probably have to do some OCR based on what you are talking about too. It's not cheap but I found their API's pretty solid and it works on most versions of the file types you are asking about. You should be able to use the free trial to see if it will fit for you project. I have no affiliation with Aspose other than that I used their tools in a production environment.

<a href="http://www.aspose.com/categories/.net-components/aspose.total-for-.net/default.aspx" rel="nofollow">Aspose Total</a>

blocks|key|2753789|text|如果你只需要文本，那么你可以使用iFilter。它不是单一的产品，但它是免费的。iFilter用于提取文本以支持微软索引服务。在iFilter+.NET+C#上搜索有关如何使用它的示例。如果你需要格式化的文本，那就不是合适的工具。它只提取有很多换行符的原始文本。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2753790|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

If you just need text then you can use iFilter. It is not a single product but it is free. iFilter is used to extract the text to support Microsoft Index Service. Search on iFilter .NET C# for examples on how to use it. If you need formatted text then not the right tool. It extracts raw text only with lot of line breaks.

I'd need a .NET library so that using which I can extract text data from PDF, Excel and Word files.

Ideally, a free tool!

Would you recommend any?

many thanks,

How to extract text from Pdf, Word and Excel documents?

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我需要一个.NET库，这样我就可以从PDF，Excel和Word文件中提取文本数据。理想情况下，是一个免费的工具！你能推荐一些吗？非常感谢，

问如何从Pdf、Word和Excel文档中提取文本？
EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从Pdf、Word和Excel文档中提取文本？EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从Pdf、Word和Excel文档中提取文本？
EN