前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >itext7知识点研究(PDF编辑)

itext7知识点研究(PDF编辑)

作者头像
老梁
发布2019-09-10 17:33:04
2.4K0
发布2019-09-10 17:33:04
举报

取出pdf文档文字

代码语言:javascript
复制
String sourceFolder2 = "E:\\picture2\\租赁合同2.pdf";
PdfDocument doc = new PdfDocument(new PdfReader(sourceFolder2));
float height = doc.getPage(1).getPageSize().getHeight();
float width = doc.getPage(1).getPageSize().getWidth();
Rectangle rect = new Rectangle(width,height);
FilteredTextEventListener filterListener = new FilteredTextEventListener(new LocationTextExtractionStrategy(), new TextRegionEventFilter(rect));
String extractedText = PdfTextExtractor.getTextFromPage(doc.getPage(1), filterListener);
System.out.println(extractedText);
  • 上面的例子就可以取出第一页所有的文字,如果需要取出某些文字需要知道文字的具体方位,画个矩形就可以取出
  • 以上代码依赖com.itextpdf.kernel
  1. 取出多个位置的文字
代码语言:javascript
复制
@Test
public void testWithMultiFilteredRenderListener() throws IOException {
    PdfDocument pdfDocument = new PdfDocument(new PdfReader(sourceFolder + "test.pdf"));

    float x1, y1, x2, y2;

    FilteredEventListener listener = new FilteredEventListener();
    x1 = 122;
    x2 = 22;
    y1 = 678.9f;
    y2 = 12;
    ITextExtractionStrategy region1Listener = listener.attachEventListener(new LocationTextExtractionStrategy(),
            new TextRegionEventFilter(new Rectangle(x1, y1, x2, y2)));

    x1 = 156;
    x2 = 13;
    y1 = 678.9f;
    y2 = 12;
    ITextExtractionStrategy region2Listener = listener.attachEventListener(new LocationTextExtractionStrategy(),
            new TextRegionEventFilter(new Rectangle(x1, y1, x2, y2)));

    PdfCanvasProcessor parser = new PdfCanvasProcessor(new GlyphEventListener(listener));
    parser.processPageContent(pdfDocument.getPage(1));

    Assert.assertEquals("Your", region1Listener.getResultantText());
    Assert.assertEquals("dju", region2Listener.getResultantText());
}
  1. 遍历pdf每个字符
    • 之前一直以为Listen监听遍历pdf文本只能一段一段遍历,现在发现他实际上提供了遍历字符的方法
    • 两个监听器,一个监听的段落,一个监听每个字符

    static class MyEventListener implements IEventListener { private List<Rectangle> rectangles = new ArrayList<>(); @Override public void eventOccurred(IEventData data, EventType type) { if (type == EventType.RENDER_TEXT) { TextRenderInfo renderInfo = (TextRenderInfo) data; Vector startPoint = renderInfo.getDescentLine().getStartPoint(); Vector endPoint = renderInfo.getAscentLine().getEndPoint(); float x1 = Math.min(startPoint.get(0), endPoint.get(0)); float x2 = Math.max(startPoint.get(0), endPoint.get(0)); float y1 = Math.min(startPoint.get(1), endPoint.get(1)); float y2 = Math.max(startPoint.get(1), endPoint.get(1)); rectangles.add(new Rectangle(x1, y1, x2 - x1, y2 - y1)); } } @Override public Set<EventType> getSupportedEvents() { return new LinkedHashSet<>(Collections.singletonList(EventType.RENDER_TEXT)); } public List<Rectangle> getRectangles() { return rectangles; } public void clear() { rectangles.clear(); } } static class MyCharacterEventListener extends MyEventListener { @Override public void eventOccurred(IEventData data, EventType type) { if (type == EventType.RENDER_TEXT) { TextRenderInfo renderInfo = (TextRenderInfo) data; for (TextRenderInfo tri : renderInfo.getCharacterRenderInfos()) { super.eventOccurred(tri, type); } } } }

    • 标记每个字符,提供了这样的方法,可以发挥想象做更多的事,给个图片更清楚点

    private void parseAndHighlight(String input, String output, boolean singleCharacters) throws IOException { PdfDocument pdfDocument = new PdfDocument(new PdfReader(input), new PdfWriter(output)); MyEventListener myEventListener = singleCharacters ? new MyCharacterEventListener() : new MyEventListener(); PdfDocumentContentParser parser = new PdfDocumentContentParser(pdfDocument); for (int pageNum = 1; pageNum <= pdfDocument.getNumberOfPages(); pageNum++) { parser.processContent(pageNum, myEventListener); List<Rectangle> rectangles = myEventListener.getRectangles(); PdfCanvas canvas = new PdfCanvas(pdfDocument.getPage(pageNum)); canvas.setLineWidth(0.5f); canvas.setStrokeColor(ColorConstants.RED); for (Rectangle rectangle : rectangles) { canvas.rectangle(rectangle); canvas.stroke(); } myEventListener.clear(); } pdfDocument.close(); }

  • 要实现上面的效果,只要调用上面的方法即可

@Test public void highlightNotDefTest() throws IOException, InterruptedException { String input = sourceFolder + "page229.pdf"; String output = outputPath + "page229.pdf"; //false 表示短语单词为单位 true表示每个字符都遍历 parseAndHighlight(input, output, false); }

  • false的效果
  1. 定位某些单词
代码语言:javascript
复制
@Test
public void findPosition() throws Exception {
    String sourceFolder2 = "E:\\picture2\\租赁合同2.pdf";
    String output = "E:\\picture2\\租赁合同2_stroke.pdf";
    PdfReader reader = new PdfReader(sourceFolder2);
    PdfDocument pdfDocument = new PdfDocument(reader, new PdfWriter(output));
    PdfPage lastPage = pdfDocument.getLastPage();
    RegexBasedLocationExtractionStrategy strategy = new RegexBasedLocationExtractionStrategy("甲方");
    PdfCanvasProcessor canvasProcessor = new PdfCanvasProcessor(strategy);
    canvasProcessor.processPageContent(lastPage);
    Collection<IPdfTextLocation> resultantLocations = strategy.getResultantLocations();
    PdfCanvas pdfCanvas = new PdfCanvas(lastPage);
    pdfCanvas.setLineWidth(0.5f);
    List<IPdfTextLocation> sets = new ArrayList<>();
    for (IPdfTextLocation location : resultantLocations) {
        Rectangle rectangle = location.getRectangle();
        pdfCanvas.rectangle(rectangle);
        pdfCanvas.setStrokeColor(ColorConstants.RED);
        pdfCanvas.stroke();
        System.out.println(rectangle.getX() + "," + rectangle.getY() + "," + rectangle.getLeft() + "," +
                rectangle.getRight() + "," + rectangle.getTop() + "," + rectangle.getBottom() + "," +
                rectangle.getWidth() + "," + rectangle.getHeight());
        System.out.println(location.getText());
        sets.add(location);
    }
    Collections.sort(sets, new Comparator<IPdfTextLocation>() {
        @Override
        public int compare(IPdfTextLocation o1, IPdfTextLocation o2) {
            return o1.getRectangle().getY() - o2.getRectangle().getY() > 0 ? 1 : o1.getRectangle().getY() - o2.getRectangle().getY() == 0 ? 0 : -1;
        }
    });
    System.out.println(sets.get(0).getRectangle().getY());
    pdfDocument.close();
}
  • 以下是输出
代码语言:javascript
复制
88.0,297.53,88.0,115.72,311.53,297.53,27.720001,14.0
甲方
213.0,674.176,213.0,241.0,688.176,674.176,28.0,14.0
甲方
227.75,767.7765,227.75,254.75,781.2765,767.7765,27.0,13.5
甲方
322.25,767.7765,322.25,349.25,781.2765,767.7765,27.0,13.5
甲方
297.53
  • 上面的方法用来合同签章定位上,已经可以做到定位最后某个特定单词

添加文字和图片

代码语言:javascript
复制
@Test
public void imagesWithDifferentDepth() throws IOException, InterruptedException {
    String outFileName = destinationFolder + "transparencyTest01.pdf";
    String cmpFileName = sourceFolder + "cmp_transparencyTest01.pdf";
    PdfDocument pdfDocument = new PdfDocument(new PdfWriter(outFileName, new WriterProperties()
            .setCompressionLevel(CompressionConstants.NO_COMPRESSION)));
    PdfPage page = pdfDocument.addNewPage(PageSize.A3);//默认添加A4
    PdfCanvas canvas = new PdfCanvas(page);
    canvas.setFillColor(ColorConstants.LIGHT_GRAY).fill();//设置填充背景色
    canvas.rectangle(80, 0, 700, 1200).fill();
    //开始添加文字
    canvas
            .saveState()
            .beginText()
            .moveText(116, 1150) //从哪里开始写
            .setFontAndSize(PdfFontFactory.createFont(StandardFonts.HELVETICA), 14) //字体和大小
            .setFillColor(ColorConstants.MAGENTA) //字体颜色
            .showText("8 bit depth PNG") //具体展示的文字
            .endText()
            .restoreState();
    //读取并添加图片到指定位置
    ImageData img = ImageDataFactory.create(sourceFolder + "manualTransparency_8bit.png");
    canvas.addImage(img, 100, 780, 200, false);
    
    //收尾步骤,关闭画布和pdf,否则pdf打开错误
    canvas.release();
    pdfDocument.close();

}

覆盖原来的文字

  1. 由于itext没提供替换pdf文字的接口,只能通过覆盖文字的形式完成
代码语言:javascript
复制
public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
        PdfReader reader = new PdfReader(src);
        PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
        PdfContentByte canvas = stamper.getUnderContent(1);
        canvas.saveState();
        canvas.setColorFill(BaseColor.YELLOW);
        canvas.rectangle(36, 786, 66, 16);
        canvas.fill();
        canvas.restoreState();
        
        //开始写入文本 
        canvas.beginText(); 
        for (Entry<String, ReplaceRegion> entry : entrys) {
            ReplaceRegion val = entry.getValue();
            //设置字体
            canvas.setFontAndSize(font.getBaseFont(), getFontSize());  
            canvas.setTextMatrix(val.getX(),val.getY()+2/*修正背景与文本的相对位置*/);
            canvas.showText((String) replaceTextMap.get(value.getAliasName()));
        }
        canvas.endText();
        
        stamper.close();
        reader.close();
    }
本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2018-11-23 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 取出pdf文档文字
  • 添加文字和图片
  • 覆盖原来的文字
相关产品与服务
数据库一体机 TData
数据库一体机 TData 是融合了高性能计算、热插拔闪存、Infiniband 网络、RDMA 远程直接存取数据的数据库解决方案,为用户提供高可用、易扩展、高性能的数据库服务,适用于 OLAP、 OLTP 以及混合负载等各种应用场景下的极限性能需求,支持 Oracle、SQL Server、MySQL 和 PostgreSQL 等各种主流数据库。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档