文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在iText 7中从pdf页面获取文本位置

问如何在iText 7中从pdf页面获取文本位置
EN

Stack Overflow用户

提问于 2017-05-03 04:41:46

回答 3查看 15.1K关注 0票数 4

我正在尝试查找PDF页面中的文本位置？

我所尝试的是通过PDF文本提取器使用简单的文本提取策略来获取PDF页面中的文本。我循环每个单词以检查我的单词是否存在。使用以下命令拆分单词：

var Words = pdftextextractor.Split(new char[] { ' ', '\n' });

我不能做的是找到文本的位置。问题是我找不到文本的位置。我需要找到的就是PDF文件中单词的y坐标。

itext7

回答 3

Stack Overflow用户

回答已采纳

发布于 2017-05-03 15:12:55

首先，SimpleTextExtractionStrategy并不完全是“最聪明”的策略(顾名思义。

其次，如果你想要这个职位，你将不得不做更多的工作。TextExtractionStrategy假设您只对文本感兴趣。

可能的实施：

实现对呈现文本的所有事件进行通知的IEventListener
get，并存储相应的TextRenderInfo object
处理完文档后，根据这些对象在此TextRenderInfo对象列表上的页面
循环中的位置对这些对象进行排序，它们同时提供所呈现的文本和坐标

如何：

implement (或扩展现有的

策略)，其中ITextExtractionStrategy表示您在步骤1中创建的策略

您的策略应设置为跟踪其处理的文本的位置

ITextExtractionStrategy在其接口中有以下方法：

@Override
public void eventOccurred(IEventData data, EventType type) {

    // you can first check the type of the event
     if (!type.equals(EventType.RENDER_TEXT))
        return;

    // now it is safe to cast
    TextRenderInfo renderInfo = (TextRenderInfo) data;
}

要记住的重要一点是，pdf中的渲染指令不需要按顺序出现。文本"Lorem Ipsum Dolor Sit Amet“可以用类似于: render "Ipsum Do”的指令来呈现。

渲染"Lorem“

呈现"lor Sit Amet“

您必须进行一些巧妙的合并(取决于两个TextRenderInfo对象之间的距离)和排序(以正确的读取顺序获得所有TextRenderInfo对象)。

一旦完成，就应该很容易了。

票数 3

Stack Overflow用户

发布于 2017-05-04 22:14:57

我可以用我之前的Itext5版本来操作它。我不知道你是不是在找C#，但下面的代码就是用它写的。

using iText.Kernel.Geom;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Data;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

class TextLocationStrategy : LocationTextExtractionStrategy
{
    private List<textChunk> objectResult = new List<textChunk>();

    public override void EventOccurred(IEventData data, EventType type)
    {
        if (!type.Equals(EventType.RENDER_TEXT))
            return;

        TextRenderInfo renderInfo = (TextRenderInfo)data;

        string curFont = renderInfo.GetFont().GetFontProgram().ToString();

        float curFontSize = renderInfo.GetFontSize();

        IList<TextRenderInfo> text = renderInfo.GetCharacterRenderInfos();
        foreach (TextRenderInfo t in text)
        {
            string letter = t.GetText();
            Vector letterStart = t.GetBaseline().GetStartPoint();
            Vector letterEnd = t.GetAscentLine().GetEndPoint();
            Rectangle letterRect = new Rectangle(letterStart.Get(0), letterStart.Get(1), letterEnd.Get(0) - letterStart.Get(0), letterEnd.Get(1) - letterStart.Get(1));

            if (letter != " " && !letter.Contains(' '))
            {
                textChunk chunk = new textChunk();
                chunk.text = letter;
                chunk.rect = letterRect;
                chunk.fontFamily = curFont;
                chunk.fontSize = curFontSize;
                chunk.spaceWidth = t.GetSingleSpaceWidth() / 2f;

                objectResult.Add(chunk);
            }
        }
    }
}
public class textChunk
{
    public string text { get; set; }
    public Rectangle rect { get; set; }
    public string fontFamily { get; set; }
    public int fontSize { get; set; }
    public float spaceWidth { get; set; }
}

我也会深入到每个单独的角色，因为它更适合我的流程。您可以操作名称，当然也可以操作对象，但我创建textchunk是为了保存我想要的内容，而不是拥有一堆renderInfo对象。

可以通过添加几行代码来实现这一点，以便从pdf中获取数据。

PdfDocument reader = new PdfDocument(new PdfReader(filepath));
FilteredEventListener listener = new FilteredEventListener();
var strat = listener.AttachEventListener(new TextExtractionStrat());
PdfCanvasProcessor processor = new PdfCanvasProcessor(listener);
processor.ProcessPageContent(reader.GetPage(1));

一旦走到这一步，您就可以将objectResult从strat中拉出来，方法是将其设置为公共的，或者在类中创建一个方法来获取objectResult并对其执行某些操作。

票数 10

Stack Overflow用户

发布于 2017-05-04 23:26:08

@Joris' answer解释了如何为任务实现全新的提取策略/事件侦听器。或者，您可以尝试调整现有的文本提取策略来执行所需的操作。

这个答案演示了如何调整现有的LocationTextExtractionStrategy以返回文本及其字符各自的y坐标。

注意，这只是一个概念验证，它特别假设文本是水平书写的，即使用有效的转换矩阵(ctm和文本矩阵组合)，其中b和c等于0。此外，TextPlusY的字符和坐标检索方法根本没有优化，可能需要很长时间才能执行。

由于OP没有表示语言偏好，这里有一个用于Java的iText7的解决方案：

TextPlusY

对于手头的任务，需要能够并排检索字符和y坐标。为了简单起见，我使用了一个类来表示两个文本及其字符各自的y坐标。它是从CharSequence派生出来的，它是String的推广，允许它用于许多与String相关的函数中：

public class TextPlusY implements CharSequence
{
    final List<String> texts = new ArrayList<>();
    final List<Float> yCoords = new ArrayList<>();

    //
    // CharSequence implementation
    //
    @Override
    public int length()
    {
        int length = 0;
        for (String text : texts)
        {
            length += text.length();
        }
        return length;
    }

    @Override
    public char charAt(int index)
    {
        for (String text : texts)
        {
            if (index < text.length())
            {
                return text.charAt(index);
            }
            index -= text.length();
        }
        throw new IndexOutOfBoundsException();
    }

    @Override
    public CharSequence subSequence(int start, int end)
    {
        TextPlusY result = new TextPlusY();
        int length = end - start;
        for (int i = 0; i < yCoords.size(); i++)
        {
            String text = texts.get(i);
            if (start < text.length())
            {
                float yCoord = yCoords.get(i); 
                if (start > 0)
                {
                    text = text.substring(start);
                    start = 0;
                }
                if (length > text.length())
                {
                    result.add(text, yCoord);
                }
                else
                {
                    result.add(text.substring(0, length), yCoord);
                    break;
                }
            }
            else
            {
                start -= text.length();
            }
        }
        return result;
    }

    //
    // Object overrides
    //
    @Override
    public String toString()
    {
        StringBuilder builder = new StringBuilder();
        for (String text : texts)
        {
            builder.append(text);
        }
        return builder.toString();
    }

    //
    // y coordinate support
    //
    public TextPlusY add(String text, float y)
    {
        if (text != null)
        {
            texts.add(text);
            yCoords.add(y);
        }
        return this;
    }

    public float yCoordAt(int index)
    {
        for (int i = 0; i < yCoords.size(); i++)
        {
            String text = texts.get(i);
            if (index < text.length())
            {
                return yCoords.get(i);
            }
            index -= text.length();
        }
        throw new IndexOutOfBoundsException();
    }
}

()

TextPlusYExtractionStrategy

现在，我们扩展LocationTextExtractionStrategy以提取TextPlusY而不是String。我们需要做的就是泛化getResultantText方法。

不幸的是，LocationTextExtractionStrategy隐藏了一些需要在这里访问的方法和成员(__、private或包保护)；因此，需要一些反射魔术。如果您的框架不允许这样做，您将不得不复制整个策略并对其进行相应的操作。

public class TextPlusYExtractionStrategy extends LocationTextExtractionStrategy
{
    static Field locationalResultField;
    static Method sortWithMarksMethod;
    static Method startsWithSpaceMethod;
    static Method endsWithSpaceMethod;

    static Method textChunkSameLineMethod;

    static
    {
        try
        {
            locationalResultField = LocationTextExtractionStrategy.class.getDeclaredField("locationalResult");
            locationalResultField.setAccessible(true);
            sortWithMarksMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("sortWithMarks", List.class);
            sortWithMarksMethod.setAccessible(true);
            startsWithSpaceMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("startsWithSpace", String.class);
            startsWithSpaceMethod.setAccessible(true);
            endsWithSpaceMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("endsWithSpace", String.class);
            endsWithSpaceMethod.setAccessible(true);

            textChunkSameLineMethod = TextChunk.class.getDeclaredMethod("sameLine", TextChunk.class);
            textChunkSameLineMethod.setAccessible(true);
        }
        catch(NoSuchFieldException | NoSuchMethodException | SecurityException e)
        {
            // Reflection failed
        }
    }

    //
    // constructors
    //
    public TextPlusYExtractionStrategy()
    {
        super();
    }

    public TextPlusYExtractionStrategy(ITextChunkLocationStrategy strat)
    {
        super(strat);
    }

    @Override
    public String getResultantText()
    {
        return getResultantTextPlusY().toString();
    }

    public TextPlusY getResultantTextPlusY()
    {
        try
        {
            List<TextChunk> textChunks = new ArrayList<>((List<TextChunk>)locationalResultField.get(this));
            sortWithMarksMethod.invoke(this, textChunks);

            TextPlusY textPlusY = new TextPlusY();
            TextChunk lastChunk = null;
            for (TextChunk chunk : textChunks)
            {
                float chunkY = chunk.getLocation().getStartLocation().get(Vector.I2);
                if (lastChunk == null)
                {
                    textPlusY.add(chunk.getText(), chunkY);
                }
                else if ((Boolean)textChunkSameLineMethod.invoke(chunk, lastChunk))
                {
                    // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
                    if (isChunkAtWordBoundary(chunk, lastChunk) &&
                            !(Boolean)startsWithSpaceMethod.invoke(this, chunk.getText()) &&
                            !(Boolean)endsWithSpaceMethod.invoke(this, lastChunk.getText()))
                    {
                        textPlusY.add(" ", chunkY);
                    }

                    textPlusY.add(chunk.getText(), chunkY);
                }
                else
                {
                    textPlusY.add("\n", lastChunk.getLocation().getStartLocation().get(Vector.I2));
                    textPlusY.add(chunk.getText(), chunkY);
                }
                lastChunk = chunk;
            }

            return textPlusY;
        }
        catch (IllegalAccessException | IllegalArgumentException | InvocationTargetException e)
        {
            throw new RuntimeException("Reflection failed", e);
        }
    }
}

()

用法

使用这两个类，您可以提取带有坐标的文本并在其中进行搜索，如下所示：

try (   PdfReader reader = new PdfReader(YOUR_PDF);
        PdfDocument document = new PdfDocument(reader)  )
{
    TextPlusYExtractionStrategy extractionStrategy = new TextPlusYExtractionStrategy();
    PdfPage page = document.getFirstPage();

    PdfCanvasProcessor parser = new PdfCanvasProcessor(extractionStrategy);
    parser.processPageContent(page);
    TextPlusY textPlusY = extractionStrategy.getResultantTextPlusY();

    System.out.printf("\nText from test.pdf\n=====\n%s\n=====\n", textPlusY);

    System.out.print("\nText with y from test.pdf\n=====\n");
    
    int length = textPlusY.length();
    float lastY = Float.MIN_NORMAL;
    for (int i = 0; i < length; i++)
    {
        float y = textPlusY.yCoordAt(i);
        if (y != lastY)
        {
            System.out.printf("\n(%4.1f) ", y);
            lastY = y;
        }
        System.out.print(textPlusY.charAt(i));
    }
    System.out.print("\n=====\n");

    System.out.print("\nMatches of 'est' with y from test.pdf\n=====\n");
    Matcher matcher = Pattern.compile("est").matcher(textPlusY);
    while (matcher.find())
    {
        System.out.printf("from character %s to %s at y position (%4.1f)\n", matcher.start(), matcher.end(), textPlusY.yCoordAt(matcher.start()));
    }
    System.out.print("\n=====\n");
}

(测试方法testExtractTextPlusYFromTest__)

对于我的测试文档

上面的测试代码的输出是

Text from test.pdf
=====
Ein Dokumen t mit einigen
T estdaten
T esttest T est test test
=====

Text with y from test.pdf
=====

(691,8) Ein Dokumen t mit einigen

(666,9) T estdaten

(642,0) T esttest T est test test
=====

Matches of 'est' with y from test.pdf
=====
from character 28 to 31 at y position (666,9)
from character 39 to 42 at y position (642,0)
from character 43 to 46 at y position (642,0)
from character 49 to 52 at y position (642,0)
from character 54 to 57 at y position (642,0)
from character 59 to 62 at y position (642,0)

=====

我的语言环境使用逗号作为小数点分隔符，您可能会看到666.9而不是666,9。

您看到的额外空格可以通过进一步微调基本LocationTextExtractionStrategy功能来删除。但这是其他问题的焦点..。

票数 5

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/43746884

复制

相似问题

问如何在iText 7中从pdf页面获取文本位置
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在iText 7中从pdf页面获取文本位置EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在iText 7中从pdf页面获取文本位置
EN