文章/答案/技术大牛

发布

社区首页 >问答首页 >PDFBox :在提取文本时维护PDF结构

问PDFBox :在提取文本时维护PDF结构
EN

Stack Overflow用户

提问于 2017-08-23 13:16:20

回答 1查看 4.7K关注 0票数 3

我正在尝试从PDF中提取文本，PDF中充满了表格。在某些情况下，列是空的。当我从PDF中提取文本时，emptys列会被跳过并替换为空格，因此，我的正则表达式无法确定在这个位置有一个没有信息的列。

更好地理解图像：

我们可以看到，在提取的文本中，列并不受尊重。

从PDF中提取文本的代码示例：

PDFTextStripper reader = new PDFTextStripper();
            reader.setSortByPosition(true);
            reader.setStartPage(page);
            reader.setEndPage(page);
            String st = reader.getText(document);
            List<String> lines = Arrays.asList(st.split(System.getProperty("line.separator")));

从原始PDF中提取文本时，如何保持其完整的结构？

非常感谢。

java

pdfbox

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-08-23 14:27:12

(这最初是https://stackoverflow.com/a/28370692/1729265，OP删除了它，包括所有答案。由于时间的关系，答案中的代码仍然是基于PDFBox 1.8.x的，因此可能需要进行一些更改，以使其在PDFBox 2.0.x中运行。)

在评论中，OP显示出对扩展PDFBox PDFTextStripper以返回文本行的解决方案感兴趣，这些文本行试图反映PDF文件布局，这可能有助于解决手头的问题。

这方面的概念证明是这样的：

public class LayoutTextStripper extends PDFTextStripper
{
    public LayoutTextStripper() throws IOException
    {
        super();
    }

    @Override
    protected void startPage(PDPage page) throws IOException
    {
        super.startPage(page);
        cropBox = page.findCropBox();
        pageLeft = cropBox.getLowerLeftX();
        beginLine();
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        float recentEnd = 0;
        for (TextPosition textPosition: textPositions)
        {
            String textHere = textPosition.getCharacter();
            if (textHere.trim().length() == 0)
                continue;

            float start = textPosition.getTextPos().getXPosition();
            boolean spacePresent = endsWithWS | textHere.startsWith(" ");

            if (needsWS | spacePresent | Math.abs(start - recentEnd) > 1)
            {
                int spacesToInsert = insertSpaces(chars, start, needsWS & !spacePresent);

                for (; spacesToInsert > 0; spacesToInsert--)
                {
                    writeString(" ");
                    chars++;
                }
            }

            writeString(textHere);
            chars += textHere.length();

            needsWS = false;
            endsWithWS = textHere.endsWith(" ");
            try
            {
                recentEnd = getEndX(textPosition);
            }
            catch (IllegalArgumentException | IllegalAccessException | NoSuchFieldException | SecurityException e)
            {
                throw new IOException("Failure retrieving endX of TextPosition", e);
            }
        }
    }

    @Override
    protected void writeLineSeparator() throws IOException
    {
        super.writeLineSeparator();
        beginLine();
    }

    @Override
    protected void writeWordSeparator() throws IOException
    {
        needsWS = true;
    }

    void beginLine()
    {
        endsWithWS = true;
        needsWS = false;
        chars = 0;
    }

    int insertSpaces(int charsInLineAlready, float chunkStart, boolean spaceRequired)
    {
        int indexNow = charsInLineAlready;
        int indexToBe = (int)((chunkStart - pageLeft) / fixedCharWidth);
        int spacesToInsert = indexToBe - indexNow;
        if (spacesToInsert < 1 && spaceRequired)
            spacesToInsert = 1;

        return spacesToInsert;
    }

    float getEndX(TextPosition textPosition) throws IllegalArgumentException, IllegalAccessException, NoSuchFieldException, SecurityException
    {
        Field field = textPosition.getClass().getDeclaredField("endX");
        field.setAccessible(true);
        return field.getFloat(textPosition);
    }

    public float fixedCharWidth = 3;

    boolean endsWithWS = true;
    boolean needsWS = false;
    int chars = 0;

    PDRectangle cropBox = null;
    float pageLeft = 0;
}

它是这样使用的：

PDDocument document = PDDocument.load(PDF);

LayoutTextStripper stripper = new LayoutTextStripper();
stripper.setSortByPosition(true);
stripper.fixedCharWidth = charWidth; // e.g. 5

String text = stripper.getText(document);

fixedCharWidth是假定的字符宽度。根据所述PDF中的文字，不同的值可能更合适。在我的样本文档中，3.6的值是值得关注的。

它实质上模拟了iText在this answer中的类似解决方案。但是，结果有所不同，因为iText文本提取转发文本块，PDFBox文本提取转发单个字符。

请注意，这只是一个概念的证明.它特别没有考虑到任何旋转。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/45840842

复制

相似问题

问PDFBox :在提取文本时维护PDF结构
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问PDFBox :在提取文本时维护PDF结构EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问PDFBox :在提取文本时维护PDF结构
EN