如何通过Pdfreader类使用itextsharp阅读PDF内容。我的PDF可能包括纯文本或文本的图像。
发布于 2011-02-15 19:47:09
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System.IO;
public string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}
发布于 2014-11-05 00:16:35
LGPL / FOSS iTextSharp 4.x
var pdfReader = new PdfReader(path); //other filestream etc
byte[] pageContent = _pdfReader .GetPageContent(pageNum); //not zero based
byte[] utf8 = Encoding.Convert(Encoding.Default, Encoding.UTF8, pageContent);
string textFromPage = Encoding.UTF8.GetString(utf8);
其他答案对我来说都没有用,它们似乎都是针对iTextSharp的AGPL v5。我在自由/开源软件版本中找不到任何对SimpleTextExtractionStrategy
或LocationTextExtractionStrategy
的引用。
与此相关的其他可能非常有用的东西:
const string PdfTableFormat = @"\(.*\)Tj";
Regex PdfTableRegex = new Regex(PdfTableFormat, RegexOptions.Compiled);
List<string> ExtractPdfContent(string rawPdfContent)
{
var matches = PdfTableRegex.Matches(rawPdfContent);
var list = matches.Cast<Match>()
.Select(m => m.Value
.Substring(1) //remove leading (
.Remove(m.Value.Length - 4) //remove trailing )Tj
.Replace(@"\)", ")") //unencode parens
.Replace(@"\(", "(")
.Trim()
)
.ToList();
return list;
}
这将从PDF中提取纯文本数据,如果显示的文本是Foo(bar)
,它将在PDF中编码为(Foo\(bar\))Tj
,此方法将按预期返回Foo(bar)
。此方法将从原始pdf内容中剥离大量附加信息,如位置坐标。
发布于 2011-09-02 02:08:45
这是一个基于ShravankumarKumar解决方案的VB.NET解决方案。
这将只为您提供文本。图像是一个不同的故事。
Public Shared Function GetTextFromPDF(PdfFileName As String) As String
Dim oReader As New iTextSharp.text.pdf.PdfReader(PdfFileName)
Dim sOut = ""
For i = 1 To oReader.NumberOfPages
Dim its As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
sOut &= iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, its)
Next
Return sOut
End Function
https://stackoverflow.com/questions/2550796
复制相似问题