我有一个PDF文件,我使用ITextExtractionStrategy.Now从字符串中读取一个子字符串,像My name is XYZ
,需要从PDF文件中获取子字符串的直角坐标,但是无法进行it.On搜索,我了解了LocationTextExtractionStrategy
,但没有得到如何使用它来获取坐标。
这是密码。
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
string getcoordinate="My name is XYZ";
如何使用ITEXTSHARP获得这个子字符串的直角坐标。
请帮帮忙。
发布于 2014-05-28 15:11:13
下面是一个非常非常简单的实现版本。
在实现之前,非常重要的一点是要知道PDF没有“单词”、“段落”、“句子”等的概念。此外,PDF中的文本不一定是从左到右和从上到下排列的,这与非LTR语言无关。"Hello“这一短语可以写进PDF格式如下:
Draw H at (10, 10)
Draw ell at (20, 10)
Draw rld at (90, 10)
Draw o Wo at (50, 20)
它也可以写成
Draw Hello World at (10,10)
您需要实现的ITextExtractionStrategy
接口有一个名为RenderText
的方法,该方法对PDF中的每个文本块都调用一次。注意我说的是“块”而不是“单词”。在上面的第一个例子中,对于这两个单词,该方法将被调用四次。在第二个例子中,这两个单词只需要调用一次。这是需要理解的非常重要的部分。PDF没有单词,正因为如此,iTextSharp也没有单词。“单词”部分是100%由你来解决的。
同样,就像我上面说的,PDF没有段落。之所以要注意这一点,是因为PDF不能将文本包装到新行。每当您看到类似段落返回的内容时,实际上都会看到一个全新的文本绘图命令,该命令具有与前一行不同的y
坐标。见this for further discussion。
下面的代码是一个非常简单的实现。为此,我正在子类LocationTextExtractionStrategy
,它已经实现了ITextExtractionStrategy
。在每次调用RenderText()
时,我都会找到当前块的矩形(使用Mark's code here),并将其存储以供以后使用。我使用这个简单的助手类来存储这些块和矩形:
//Helper class that stores our rectangle and text
public class RectAndText {
public iTextSharp.text.Rectangle Rect;
public String Text;
public RectAndText(iTextSharp.text.Rectangle rect, String text) {
this.Rect = rect;
this.Text = text;
}
}
下面是子类:
public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
//Hold each coordinate
public List<RectAndText> myPoints = new List<RectAndText>();
//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo renderInfo) {
base.RenderText(renderInfo);
//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);
//Add this to our main collection
this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
}
}
最后,执行上述内容:
//Our test file
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
//Create our test file, nothing special
using (var fs = new FileStream(testFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
using (var doc = new Document()) {
using (var writer = PdfWriter.GetInstance(doc, fs)) {
doc.Open();
doc.Add(new Paragraph("This is my sample file"));
doc.Close();
}
}
}
//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();
//Parse page 1 of the document above
using (var r = new PdfReader(testFile)) {
var ex = PdfTextExtractor.GetTextFromPage(r, 1, t);
}
//Loop through each chunk found
foreach (var p in t.myPoints) {
Console.WriteLine(string.Format("Found text {0} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom));
}
我要强调的是,上面的没有考虑到“单词”,这取决于您。传递给RenderText
的RenderText
对象有一个名为GetCharacterRenderInfos()
的方法,您可以使用该方法获取更多信息。如果您不关心字体中的下降器,也可能需要使用GetBaseline() instead of
GetDescentLine()`。
编辑
(我午餐吃得很好,所以我觉得更有帮助。)
下面是一个更新的MyLocationTextExtractionStrategy
版本,它完成了我下面的评论,即它需要一个字符串来搜索和搜索每个块中的字符串。由于所列的所有原因,这在某些/许多/大多数/所有情况下都行不通。如果子字符串在单个块中存在多次,那么它也只能返回第一个实例。结扎和解说词也会把这件事搞砸。
public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
//Hold each coordinate
public List<RectAndText> myPoints = new List<RectAndText>();
//The string that we're searching for
public String TextToSearchFor { get; set; }
//How to compare strings
public System.Globalization.CompareOptions CompareOptions { get; set; }
public MyLocationTextExtractionStrategy(String textToSearchFor, System.Globalization.CompareOptions compareOptions = System.Globalization.CompareOptions.None) {
this.TextToSearchFor = textToSearchFor;
this.CompareOptions = compareOptions;
}
//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo renderInfo) {
base.RenderText(renderInfo);
//See if the current chunk contains the text
var startPosition = System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), this.TextToSearchFor, this.CompareOptions);
//If not found bail
if (startPosition < 0) {
return;
}
//Grab the individual characters
var chars = renderInfo.GetCharacterRenderInfos().Skip(startPosition).Take(this.TextToSearchFor.Length).ToList();
//Grab the first and last character
var firstChar = chars.First();
var lastChar = chars.Last();
//Get the bounding box for the chunk of text
var bottomLeft = firstChar.GetDescentLine().GetStartPoint();
var topRight = lastChar.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);
//Add this to our main collection
this.myPoints.Add(new RectAndText(rect, this.TextToSearchFor));
}
您可以像以前一样使用这个参数,但是现在构造函数只有一个必需的参数:
var t = new MyLocationTextExtractionStrategy("sample");
发布于 2015-10-08 11:26:22
这是个老问题,但我把我的回答留在这里,因为我在网上找不到正确的答案。
正如克里斯·哈斯( Chris )所揭示的,像iText处理块一样,处理单词并不容易。Chris post在我的大部分测试中失败的代码,因为一个单词通常被分割成不同的块(他在帖子中警告说)。
为了在这里解决这个问题,我采用的策略是:
我把密码留在这里。我用几个文档测试它,它运行得很好,但在某些情况下可能会失败,因为这个块->单词转换有点棘手。
希望这对某人有帮助。
class LocationTextExtractionStrategyEx : LocationTextExtractionStrategy
{
private List<LocationTextExtractionStrategyEx.ExtendedTextChunk> m_DocChunks = new List<ExtendedTextChunk>();
private List<LocationTextExtractionStrategyEx.LineInfo> m_LinesTextInfo = new List<LineInfo>();
public List<SearchResult> m_SearchResultsList = new List<SearchResult>();
private String m_SearchText;
public const float PDF_PX_TO_MM = 0.3528f;
public float m_PageSizeY;
public LocationTextExtractionStrategyEx(String sSearchText, float fPageSizeY)
: base()
{
this.m_SearchText = sSearchText;
this.m_PageSizeY = fPageSizeY;
}
private void searchText()
{
foreach (LineInfo aLineInfo in m_LinesTextInfo)
{
int iIndex = aLineInfo.m_Text.IndexOf(m_SearchText);
if (iIndex != -1)
{
TextRenderInfo aFirstLetter = aLineInfo.m_LineCharsList.ElementAt(iIndex);
SearchResult aSearchResult = new SearchResult(aFirstLetter, m_PageSizeY);
this.m_SearchResultsList.Add(aSearchResult);
}
}
}
private void groupChunksbyLine()
{
LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk1 = null;
LocationTextExtractionStrategyEx.LineInfo textInfo = null;
foreach (LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk2 in this.m_DocChunks)
{
if (textChunk1 == null)
{
textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2);
this.m_LinesTextInfo.Add(textInfo);
}
else if (textChunk2.sameLine(textChunk1))
{
textInfo.appendText(textChunk2);
}
else
{
textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2);
this.m_LinesTextInfo.Add(textInfo);
}
textChunk1 = textChunk2;
}
}
public override string GetResultantText()
{
groupChunksbyLine();
searchText();
//In this case the return value is not useful
return "";
}
public override void RenderText(TextRenderInfo renderInfo)
{
LineSegment baseline = renderInfo.GetBaseline();
//Create ExtendedChunk
ExtendedTextChunk aExtendedChunk = new ExtendedTextChunk(renderInfo.GetText(), baseline.GetStartPoint(), baseline.GetEndPoint(), renderInfo.GetSingleSpaceWidth(), renderInfo.GetCharacterRenderInfos().ToList());
this.m_DocChunks.Add(aExtendedChunk);
}
public class ExtendedTextChunk
{
public string m_text;
private Vector m_startLocation;
private Vector m_endLocation;
private Vector m_orientationVector;
private int m_orientationMagnitude;
private int m_distPerpendicular;
private float m_charSpaceWidth;
public List<TextRenderInfo> m_ChunkChars;
public ExtendedTextChunk(string txt, Vector startLoc, Vector endLoc, float charSpaceWidth,List<TextRenderInfo> chunkChars)
{
this.m_text = txt;
this.m_startLocation = startLoc;
this.m_endLocation = endLoc;
this.m_charSpaceWidth = charSpaceWidth;
this.m_orientationVector = this.m_endLocation.Subtract(this.m_startLocation).Normalize();
this.m_orientationMagnitude = (int)(Math.Atan2((double)this.m_orientationVector[1], (double)this.m_orientationVector[0]) * 1000.0);
this.m_distPerpendicular = (int)this.m_startLocation.Subtract(new Vector(0.0f, 0.0f, 1f)).Cross(this.m_orientationVector)[2];
this.m_ChunkChars = chunkChars;
}
public bool sameLine(LocationTextExtractionStrategyEx.ExtendedTextChunk textChunkToCompare)
{
return this.m_orientationMagnitude == textChunkToCompare.m_orientationMagnitude && this.m_distPerpendicular == textChunkToCompare.m_distPerpendicular;
}
}
public class SearchResult
{
public int iPosX;
public int iPosY;
public SearchResult(TextRenderInfo aCharcter, float fPageSizeY)
{
//Get position of upperLeft coordinate
Vector vTopLeft = aCharcter.GetAscentLine().GetStartPoint();
//PosX
float fPosX = vTopLeft[Vector.I1];
//PosY
float fPosY = vTopLeft[Vector.I2];
//Transform to mm and get y from top of page
iPosX = Convert.ToInt32(fPosX * PDF_PX_TO_MM);
iPosY = Convert.ToInt32((fPageSizeY - fPosY) * PDF_PX_TO_MM);
}
}
public class LineInfo
{
public string m_Text;
public List<TextRenderInfo> m_LineCharsList;
public LineInfo(LocationTextExtractionStrategyEx.ExtendedTextChunk initialTextChunk)
{
this.m_Text = initialTextChunk.m_text;
this.m_LineCharsList = initialTextChunk.m_ChunkChars;
}
public void appendText(LocationTextExtractionStrategyEx.ExtendedTextChunk additionalTextChunk)
{
m_LineCharsList.AddRange(additionalTextChunk.m_ChunkChars);
this.m_Text += additionalTextChunk.m_text;
}
}
}
发布于 2017-06-09 06:11:44
我知道这是一个很老的问题,但下面是我最后要做的事情。只是张贴在这里,希望它将是有用的其他人。
下面的代码将告诉您包含搜索文本的行的起始坐标。它应该不难修改,以提供立场的文字。请注意。我在itextsharp 5.5.11.0上测试了这一点,并且在一些旧版本上不能工作。
如上所述,pdfs没有单词/行或段落的概念。但我发现LocationTextExtractionStrategy
在分割行和字方面做得很好。所以我的解决方案就是基于这个。
免责声明:
该解决方案基于https://github.com/itext/itextsharp/blob/develop/src/core/iTextSharp/text/pdf/parser/LocationTextExtractionStrategy.cs,该文件有一个注释,表示它是一个开发预览。因此,这在未来可能行不通。
不管怎么说这是密码。
using System.Collections.Generic;
using iTextSharp.text.pdf.parser;
namespace Logic
{
public class LocationTextExtractionStrategyWithPosition : LocationTextExtractionStrategy
{
private readonly List<TextChunk> locationalResult = new List<TextChunk>();
private readonly ITextChunkLocationStrategy tclStrat;
public LocationTextExtractionStrategyWithPosition() : this(new TextChunkLocationStrategyDefaultImp()) {
}
/**
* Creates a new text extraction renderer, with a custom strategy for
* creating new TextChunkLocation objects based on the input of the
* TextRenderInfo.
* @param strat the custom strategy
*/
public LocationTextExtractionStrategyWithPosition(ITextChunkLocationStrategy strat)
{
tclStrat = strat;
}
private bool StartsWithSpace(string str)
{
if (str.Length == 0) return false;
return str[0] == ' ';
}
private bool EndsWithSpace(string str)
{
if (str.Length == 0) return false;
return str[str.Length - 1] == ' ';
}
/**
* Filters the provided list with the provided filter
* @param textChunks a list of all TextChunks that this strategy found during processing
* @param filter the filter to apply. If null, filtering will be skipped.
* @return the filtered list
* @since 5.3.3
*/
private List<TextChunk> filterTextChunks(List<TextChunk> textChunks, ITextChunkFilter filter)
{
if (filter == null)
{
return textChunks;
}
var filtered = new List<TextChunk>();
foreach (var textChunk in textChunks)
{
if (filter.Accept(textChunk))
{
filtered.Add(textChunk);
}
}
return filtered;
}
public override void RenderText(TextRenderInfo renderInfo)
{
LineSegment segment = renderInfo.GetBaseline();
if (renderInfo.GetRise() != 0)
{ // remove the rise from the baseline - we do this because the text from a super/subscript render operations should probably be considered as part of the baseline of the text the super/sub is relative to
Matrix riseOffsetTransform = new Matrix(0, -renderInfo.GetRise());
segment = segment.TransformBy(riseOffsetTransform);
}
TextChunk tc = new TextChunk(renderInfo.GetText(), tclStrat.CreateLocation(renderInfo, segment));
locationalResult.Add(tc);
}
public IList<TextLocation> GetLocations()
{
var filteredTextChunks = filterTextChunks(locationalResult, null);
filteredTextChunks.Sort();
TextChunk lastChunk = null;
var textLocations = new List<TextLocation>();
foreach (var chunk in filteredTextChunks)
{
if (lastChunk == null)
{
//initial
textLocations.Add(new TextLocation
{
Text = chunk.Text,
X = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[0]),
Y = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[1])
});
}
else
{
if (chunk.SameLine(lastChunk))
{
var text = "";
// we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
if (IsChunkAtWordBoundary(chunk, lastChunk) && !StartsWithSpace(chunk.Text) && !EndsWithSpace(lastChunk.Text))
text += ' ';
text += chunk.Text;
textLocations[textLocations.Count - 1].Text += text;
}
else
{
textLocations.Add(new TextLocation
{
Text = chunk.Text,
X = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[0]),
Y = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[1])
});
}
}
lastChunk = chunk;
}
//now find the location(s) with the given texts
return textLocations;
}
}
public class TextLocation
{
public float X { get; set; }
public float Y { get; set; }
public string Text { get; set; }
}
}
如何调用该方法:
using (var reader = new PdfReader(inputPdf))
{
var parser = new PdfReaderContentParser(reader);
var strategy = parser.ProcessContent(pageNumber, new LocationTextExtractionStrategyWithPosition());
var res = strategy.GetLocations();
reader.Close();
}
var searchResult = res.Where(p => p.Text.Contains(searchText)).OrderBy(p => p.Y).Reverse().ToList();
inputPdf is a byte[] that has the pdf data
pageNumber is the page where you want to search in
https://stackoverflow.com/questions/23909893
复制相似问题