文章/答案/技术大牛

发布

社区首页 >问答首页 >用ITextExtractionStrategy和LocationTextExtractionStrategy获取字符串的坐标

问用ITextExtractionStrategy和LocationTextExtractionStrategy获取字符串的坐标
EN

Stack Overflow用户

提问于 2014-05-28 11:05:15

回答 4查看 49.8K关注 0票数 21

我有一个PDF文件，我使用ITextExtractionStrategy.Now从字符串中读取一个子字符串，像My name is XYZ，需要从PDF文件中获取子字符串的直角坐标，但是无法进行it.On搜索，我了解了LocationTextExtractionStrategy，但没有得到如何使用它来获取坐标。

这是密码。

ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);

string getcoordinate="My name is XYZ";

如何使用ITEXTSHARP获得这个子字符串的直角坐标。

请帮帮忙。

itextsharp

回答 4

Stack Overflow用户

回答已采纳

发布于 2014-05-28 15:11:13

下面是一个非常非常简单的实现版本。

在实现之前，非常重要的一点是要知道PDF没有“单词”、“段落”、“句子”等的概念。此外，PDF中的文本不一定是从左到右和从上到下排列的，这与非LTR语言无关。"Hello“这一短语可以写进PDF格式如下：

Draw H at (10, 10)
Draw ell at (20, 10)
Draw rld at (90, 10)
Draw o Wo at (50, 20)

它也可以写成

Draw Hello World at (10,10)

您需要实现的ITextExtractionStrategy接口有一个名为RenderText的方法，该方法对PDF中的每个文本块都调用一次。注意我说的是“块”而不是“单词”。在上面的第一个例子中，对于这两个单词，该方法将被调用四次。在第二个例子中，这两个单词只需要调用一次。这是需要理解的非常重要的部分。PDF没有单词，正因为如此，iTextSharp也没有单词。“单词”部分是100%由你来解决的。

同样，就像我上面说的，PDF没有段落。之所以要注意这一点，是因为PDF不能将文本包装到新行。每当您看到类似段落返回的内容时，实际上都会看到一个全新的文本绘图命令，该命令具有与前一行不同的y坐标。见this for further discussion。

下面的代码是一个非常简单的实现。为此，我正在子类LocationTextExtractionStrategy，它已经实现了ITextExtractionStrategy。在每次调用RenderText()时，我都会找到当前块的矩形(使用Mark's code here)，并将其存储以供以后使用。我使用这个简单的助手类来存储这些块和矩形：

//Helper class that stores our rectangle and text
public class RectAndText {
    public iTextSharp.text.Rectangle Rect;
    public String Text;
    public RectAndText(iTextSharp.text.Rectangle rect, String text) {
        this.Rect = rect;
        this.Text = text;
    }
}

下面是子类：

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
    //Hold each coordinate
    public List<RectAndText> myPoints = new List<RectAndText>();

    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo renderInfo) {
        base.RenderText(renderInfo);

        //Get the bounding box for the chunk of text
        var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
        var topRight = renderInfo.GetAscentLine().GetEndPoint();

        //Create a rectangle from it
        var rect = new iTextSharp.text.Rectangle(
                                                bottomLeft[Vector.I1],
                                                bottomLeft[Vector.I2],
                                                topRight[Vector.I1],
                                                topRight[Vector.I2]
                                                );

        //Add this to our main collection
        this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
    }
}

最后，执行上述内容：

//Our test file
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");

//Create our test file, nothing special
using (var fs = new FileStream(testFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
    using (var doc = new Document()) {
        using (var writer = PdfWriter.GetInstance(doc, fs)) {
            doc.Open();

            doc.Add(new Paragraph("This is my sample file"));

            doc.Close();
        }
    }
}

//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();

//Parse page 1 of the document above
using (var r = new PdfReader(testFile)) {
    var ex = PdfTextExtractor.GetTextFromPage(r, 1, t);
}

//Loop through each chunk found
foreach (var p in t.myPoints) {
    Console.WriteLine(string.Format("Found text {0} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom));
}

我要强调的是，上面的没有考虑到“单词”，这取决于您。传递给RenderText的RenderText对象有一个名为GetCharacterRenderInfos()的方法，您可以使用该方法获取更多信息。如果您不关心字体中的下降器，也可能需要使用GetBaseline() instead ofGetDescentLine()`。

编辑

(我午餐吃得很好，所以我觉得更有帮助。)

下面是一个更新的MyLocationTextExtractionStrategy版本，它完成了我下面的评论，即它需要一个字符串来搜索和搜索每个块中的字符串。由于所列的所有原因，这在某些/许多/大多数/所有情况下都行不通。如果子字符串在单个块中存在多次，那么它也只能返回第一个实例。结扎和解说词也会把这件事搞砸。

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
    //Hold each coordinate
    public List<RectAndText> myPoints = new List<RectAndText>();

    //The string that we're searching for
    public String TextToSearchFor { get; set; }

    //How to compare strings
    public System.Globalization.CompareOptions CompareOptions { get; set; }

    public MyLocationTextExtractionStrategy(String textToSearchFor, System.Globalization.CompareOptions compareOptions = System.Globalization.CompareOptions.None) {
        this.TextToSearchFor = textToSearchFor;
        this.CompareOptions = compareOptions;
    }

    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo renderInfo) {
        base.RenderText(renderInfo);

        //See if the current chunk contains the text
        var startPosition = System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), this.TextToSearchFor, this.CompareOptions);

        //If not found bail
        if (startPosition < 0) {
            return;
        }

        //Grab the individual characters
        var chars = renderInfo.GetCharacterRenderInfos().Skip(startPosition).Take(this.TextToSearchFor.Length).ToList();

        //Grab the first and last character
        var firstChar = chars.First();
        var lastChar = chars.Last();


        //Get the bounding box for the chunk of text
        var bottomLeft = firstChar.GetDescentLine().GetStartPoint();
        var topRight = lastChar.GetAscentLine().GetEndPoint();

        //Create a rectangle from it
        var rect = new iTextSharp.text.Rectangle(
                                                bottomLeft[Vector.I1],
                                                bottomLeft[Vector.I2],
                                                topRight[Vector.I1],
                                                topRight[Vector.I2]
                                                );

        //Add this to our main collection
        this.myPoints.Add(new RectAndText(rect, this.TextToSearchFor));
    }

您可以像以前一样使用这个参数，但是现在构造函数只有一个必需的参数：

var t = new MyLocationTextExtractionStrategy("sample");

票数 48

Stack Overflow用户

发布于 2015-10-08 11:26:22

这是个老问题，但我把我的回答留在这里，因为我在网上找不到正确的答案。

正如克里斯·哈斯( Chris )所揭示的，像iText处理块一样，处理单词并不容易。Chris post在我的大部分测试中失败的代码，因为一个单词通常被分割成不同的块(他在帖子中警告说)。

为了在这里解决这个问题，我采用的策略是：

字符中的分块(实际上，每个字符都有textrenderinfo对象)
按行分组字符。这是不直接的，因为你必须处理块对齐。
搜索每一行你需要找到的单词

我把密码留在这里。我用几个文档测试它，它运行得很好，但在某些情况下可能会失败，因为这个块->单词转换有点棘手。

希望这对某人有帮助。

  class LocationTextExtractionStrategyEx : LocationTextExtractionStrategy
{
    private List<LocationTextExtractionStrategyEx.ExtendedTextChunk> m_DocChunks = new List<ExtendedTextChunk>();
    private List<LocationTextExtractionStrategyEx.LineInfo> m_LinesTextInfo = new List<LineInfo>();
    public List<SearchResult> m_SearchResultsList = new List<SearchResult>();
    private String m_SearchText;
    public const float PDF_PX_TO_MM = 0.3528f;
    public float m_PageSizeY;


    public LocationTextExtractionStrategyEx(String sSearchText, float fPageSizeY)
        : base()
    {
        this.m_SearchText = sSearchText;
        this.m_PageSizeY = fPageSizeY;
    }

    private void searchText()
    {
        foreach (LineInfo aLineInfo in m_LinesTextInfo)
        {
            int iIndex = aLineInfo.m_Text.IndexOf(m_SearchText);
            if (iIndex != -1)
            {
                TextRenderInfo aFirstLetter = aLineInfo.m_LineCharsList.ElementAt(iIndex);
                SearchResult aSearchResult = new SearchResult(aFirstLetter, m_PageSizeY);
                this.m_SearchResultsList.Add(aSearchResult);
            }
        }
    }

    private void groupChunksbyLine()
    {                     
        LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk1 = null;
        LocationTextExtractionStrategyEx.LineInfo textInfo = null;
        foreach (LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk2 in this.m_DocChunks)
        {
            if (textChunk1 == null)
            {                    
                textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2);
                this.m_LinesTextInfo.Add(textInfo);
            }
            else if (textChunk2.sameLine(textChunk1))
            {                      
                textInfo.appendText(textChunk2);
            }
            else
            {                                        
                textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2);
                this.m_LinesTextInfo.Add(textInfo);
            }
            textChunk1 = textChunk2;
        }
    }

    public override string GetResultantText()
    {
        groupChunksbyLine();
        searchText();
        //In this case the return value is not useful
        return "";
    }

    public override void RenderText(TextRenderInfo renderInfo)
    {
        LineSegment baseline = renderInfo.GetBaseline();
        //Create ExtendedChunk
        ExtendedTextChunk aExtendedChunk = new ExtendedTextChunk(renderInfo.GetText(), baseline.GetStartPoint(), baseline.GetEndPoint(), renderInfo.GetSingleSpaceWidth(), renderInfo.GetCharacterRenderInfos().ToList());
        this.m_DocChunks.Add(aExtendedChunk);
    }

    public class ExtendedTextChunk
    {
        public string m_text;
        private Vector m_startLocation;
        private Vector m_endLocation;
        private Vector m_orientationVector;
        private int m_orientationMagnitude;
        private int m_distPerpendicular;           
        private float m_charSpaceWidth;           
        public List<TextRenderInfo> m_ChunkChars;


        public ExtendedTextChunk(string txt, Vector startLoc, Vector endLoc, float charSpaceWidth,List<TextRenderInfo> chunkChars)
        {
            this.m_text = txt;
            this.m_startLocation = startLoc;
            this.m_endLocation = endLoc;
            this.m_charSpaceWidth = charSpaceWidth;                
            this.m_orientationVector = this.m_endLocation.Subtract(this.m_startLocation).Normalize();
            this.m_orientationMagnitude = (int)(Math.Atan2((double)this.m_orientationVector[1], (double)this.m_orientationVector[0]) * 1000.0);
            this.m_distPerpendicular = (int)this.m_startLocation.Subtract(new Vector(0.0f, 0.0f, 1f)).Cross(this.m_orientationVector)[2];                
            this.m_ChunkChars = chunkChars;

        }


        public bool sameLine(LocationTextExtractionStrategyEx.ExtendedTextChunk textChunkToCompare)
        {
            return this.m_orientationMagnitude == textChunkToCompare.m_orientationMagnitude && this.m_distPerpendicular == textChunkToCompare.m_distPerpendicular;
        }


    }

    public class SearchResult
    {
        public int iPosX;
        public int iPosY;

        public SearchResult(TextRenderInfo aCharcter, float fPageSizeY)
        {
            //Get position of upperLeft coordinate
            Vector vTopLeft = aCharcter.GetAscentLine().GetStartPoint();
            //PosX
            float fPosX = vTopLeft[Vector.I1]; 
            //PosY
            float fPosY = vTopLeft[Vector.I2];
            //Transform to mm and get y from top of page
            iPosX = Convert.ToInt32(fPosX * PDF_PX_TO_MM);
            iPosY = Convert.ToInt32((fPageSizeY - fPosY) * PDF_PX_TO_MM);
        }
    }

    public class LineInfo
    {            
        public string m_Text;
        public List<TextRenderInfo> m_LineCharsList;

        public LineInfo(LocationTextExtractionStrategyEx.ExtendedTextChunk initialTextChunk)
        {                
            this.m_Text = initialTextChunk.m_text;
            this.m_LineCharsList = initialTextChunk.m_ChunkChars;
        }

        public void appendText(LocationTextExtractionStrategyEx.ExtendedTextChunk additionalTextChunk)
        {
            m_LineCharsList.AddRange(additionalTextChunk.m_ChunkChars);
            this.m_Text += additionalTextChunk.m_text;
        }
    }
}

票数 9

Stack Overflow用户

发布于 2017-06-09 06:11:44

我知道这是一个很老的问题，但下面是我最后要做的事情。只是张贴在这里，希望它将是有用的其他人。

下面的代码将告诉您包含搜索文本的行的起始坐标。它应该不难修改，以提供立场的文字。请注意。我在itextsharp 5.5.11.0上测试了这一点，并且在一些旧版本上不能工作。

如上所述，pdfs没有单词/行或段落的概念。但我发现LocationTextExtractionStrategy在分割行和字方面做得很好。所以我的解决方案就是基于这个。

免责声明：

该解决方案基于https://github.com/itext/itextsharp/blob/develop/src/core/iTextSharp/text/pdf/parser/LocationTextExtractionStrategy.cs，该文件有一个注释，表示它是一个开发预览。因此，这在未来可能行不通。

不管怎么说这是密码。

using System.Collections.Generic;
using iTextSharp.text.pdf.parser;

namespace Logic
{
    public class LocationTextExtractionStrategyWithPosition : LocationTextExtractionStrategy
    {
        private readonly List<TextChunk> locationalResult = new List<TextChunk>();

        private readonly ITextChunkLocationStrategy tclStrat;

        public LocationTextExtractionStrategyWithPosition() : this(new TextChunkLocationStrategyDefaultImp()) {
        }

        /**
         * Creates a new text extraction renderer, with a custom strategy for
         * creating new TextChunkLocation objects based on the input of the
         * TextRenderInfo.
         * @param strat the custom strategy
         */
        public LocationTextExtractionStrategyWithPosition(ITextChunkLocationStrategy strat)
        {
            tclStrat = strat;
        }


        private bool StartsWithSpace(string str)
        {
            if (str.Length == 0) return false;
            return str[0] == ' ';
        }


        private bool EndsWithSpace(string str)
        {
            if (str.Length == 0) return false;
            return str[str.Length - 1] == ' ';
        }

        /**
         * Filters the provided list with the provided filter
         * @param textChunks a list of all TextChunks that this strategy found during processing
         * @param filter the filter to apply.  If null, filtering will be skipped.
         * @return the filtered list
         * @since 5.3.3
         */

        private List<TextChunk> filterTextChunks(List<TextChunk> textChunks, ITextChunkFilter filter)
        {
            if (filter == null)
            {
                return textChunks;
            }

            var filtered = new List<TextChunk>();

            foreach (var textChunk in textChunks)
            {
                if (filter.Accept(textChunk))
                {
                    filtered.Add(textChunk);
                }
            }

            return filtered;
        }

        public override void RenderText(TextRenderInfo renderInfo)
        {
            LineSegment segment = renderInfo.GetBaseline();
            if (renderInfo.GetRise() != 0)
            { // remove the rise from the baseline - we do this because the text from a super/subscript render operations should probably be considered as part of the baseline of the text the super/sub is relative to 
                Matrix riseOffsetTransform = new Matrix(0, -renderInfo.GetRise());
                segment = segment.TransformBy(riseOffsetTransform);
            }
            TextChunk tc = new TextChunk(renderInfo.GetText(), tclStrat.CreateLocation(renderInfo, segment));
            locationalResult.Add(tc);
        }


        public IList<TextLocation> GetLocations()
        {

            var filteredTextChunks = filterTextChunks(locationalResult, null);
            filteredTextChunks.Sort();

            TextChunk lastChunk = null;

             var textLocations = new List<TextLocation>();

            foreach (var chunk in filteredTextChunks)
            {

                if (lastChunk == null)
                {
                    //initial
                    textLocations.Add(new TextLocation
                    {
                        Text = chunk.Text,
                        X = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[0]),
                        Y = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[1])
                    });

                }
                else
                {
                    if (chunk.SameLine(lastChunk))
                    {
                        var text = "";
                        // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
                        if (IsChunkAtWordBoundary(chunk, lastChunk) && !StartsWithSpace(chunk.Text) && !EndsWithSpace(lastChunk.Text))
                            text += ' ';

                        text += chunk.Text;

                        textLocations[textLocations.Count - 1].Text += text;

                    }
                    else
                    {

                        textLocations.Add(new TextLocation
                        {
                            Text = chunk.Text,
                            X = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[0]),
                            Y = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[1])
                        });
                    }
                }
                lastChunk = chunk;
            }

            //now find the location(s) with the given texts
            return textLocations;

        }

    }

    public class TextLocation
    {
        public float X { get; set; }
        public float Y { get; set; }

        public string Text { get; set; }
    }
}

如何调用该方法：

        using (var reader = new PdfReader(inputPdf))
            {

                var parser = new PdfReaderContentParser(reader);

                var strategy = parser.ProcessContent(pageNumber, new LocationTextExtractionStrategyWithPosition());

                var res = strategy.GetLocations();

                reader.Close();
             }
                var searchResult = res.Where(p => p.Text.Contains(searchText)).OrderBy(p => p.Y).Reverse().ToList();




inputPdf is a byte[] that has the pdf data

pageNumber is the page where you want to search in

票数 5

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/23909893

复制

相似问题

问用ITextExtractionStrategy和LocationTextExtractionStrategy获取字符串的坐标
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用ITextExtractionStrategy和LocationTextExtractionStrategy获取字符串的坐标EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用ITextExtractionStrategy和LocationTextExtractionStrategy获取字符串的坐标
EN