首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >如何打开和替换java apache中的PDFBox流中的数据?

如何打开和替换java apache中的PDFBox流中的数据?
EN

Stack Overflow用户
提问于 2015-09-16 19:43:22
回答 1查看 5.9K关注 0票数 1

我在我的java代码(Java1.6)中使用了apache 2.0.0版本。我正在努力弄清楚如何获得、替换并保存到我的pdf中

代码语言:javascript
运行
复制
<stream> data here... <endstream> ?

我的pdf文件如下:

代码语言:javascript
运行
复制
596 0 obj
<<
/Filter /FlateDecode
/Length 3739
>>
stream
xњ­[ЫnЬF}џoШ8эІАђhЮ/‰`@С%Hvќd-н“іXPJГ ...
endstream
endobj

我找到了一个解决方案,我可以解码这条流。我使用了pdfbox 1.8.10.jar api中的"WriteDecodedDoc“命令。所以现在我有了文件的两个变体,但是我不知道如何处理这个流。此流包含页脚和页眉,其中放置了图像和文本。

我用PDFTextStripper类检查了我的文件。它可以从流中看到必要的数据,但是我不能使用这个类来替换数据并将数据保存回pdf文件。

我试着替换这个文本,只是打开一个文件作为文本,搜索文本,只在流中替换它并保存。但我对“无法提取嵌入式字体.”有一个问题。主要原因是我丢失了编码。我试着改变这个编码,但对我没有帮助。

顺便说一下,我不能使用iText。我应该在这里用免费的唇语。

谢谢你的解决方案。

编辑:

解码后,我的流就像

代码语言:javascript
运行
复制
stream
/CS0 CS 0.412 0.416 0.423  SCN
0.25 w 
/GS0 gs
q 1 0 0 1 72 78.425 cm
0 0 m
468 0 l
S
Q
/Span <</Lang (en-US)/MCID 83 >>BDC 
BT
/T1_1 1 Tf
8 0 0 8 237.0609 64.8 Tm
[(www)11(.li)-14.9(nkto)-10(thesi)-8(tesho)-7.9(ouldbehere)15.1(.com)]TJ
/Span<</ActualText<FEFF0009>>> BDC 
( )Tj
endstream

我需要更换一个链接到一个不同的链接内流。这个:

代码语言:javascript
运行
复制
[(www)11(.li)-14.9(nkto)-10(thesi)-8(tesho)-7.9(ouldbehere)15.1(.com)]TJ

编辑2代码

代码语言:javascript
运行
复制
public static void replaceLinksInPdf(String filePath) {
        PDDocument document = null;
        try {
            document = PDDocument.load(new File(filePath));
            if (document.isEncrypted()) {
                document.setAllSecurityToBeRemoved(true);
                System.out.println(filePath + " Doc was decrypted");
            }

            // COSBase cosb = document.getDocument().getObjects().get(27);
            // e.g. this object contains <stream> bytecode <endstream> in the PDF file.
            // it looks that
            // document -> getDocument() -> objectPool #27 -> baseObject -> randomAccess -> bufferList size 10 has a data that I can't open and work
            // document -> getDocument() -> objectPool #27 -> baseObject -> items -> all PDF's tag but NO a stream section

            int pageNum = 0;
            for (PDPage page : document.getPages()) {
                PDFStreamParser parser = new PDFStreamParser(page);
                parser.parse();
                List<Object> tokens = parser.getTokens();
                List<Object> newTokens = new ArrayList<Object>();

                for (Object token : tokens) {
                    if (token instanceof Operator) {
                        COSDictionary dictionary = ((Operator) token).getImageParameters();
                        if (dictionary != null) {
                            System.out.println(dictionary.toString());
                        }
                    }
                    if (token instanceof Operator) {
                        Operator op = (Operator) token;
                        if (op.getName().equals("Tj")) {
                            // Tj contains 1 COSString
                            COSString previous = (COSString) newTokens.get(newTokens.size() - 1);
                            String string = previous.getString();
                            // check if string contains a necessary link
                            if (string.equals("www.linkhouldbehere.com")) {
                                COSArray newLink = new COSArray();
                                newLink.add(new COSString("test2.test2.com"));
                                newTokens.set(newTokens.size() - 1, newLink);
                            }
                        } else if (op.getName().equals("TJ")) {
                            // TJ contains a COSArray with COSStrings and COSFloat (padding)
                            COSArray previous = (COSArray) newTokens.get(newTokens.size() - 1);
                            String string = "";
                            for (int k = 0; k < previous.size(); k++) {
                                Object arrElement = previous.getObject(k);
                                if (arrElement instanceof COSString) {
                                    COSString cosString = (COSString) arrElement;
                                    String content = cosString.getString();
                                    string += content;
                                }
                            }
                            // check if string contains a necessary link
                            if (string.equals("www.linkhouldbehere.com")) {
                                COSArray newLink = new COSArray();
                                newLink.add(new COSString("test.test.com"));
                                newTokens.set(newTokens.size() - 1, newLink);
                            } else if (string.startsWith("www.linkhouldbehere.com")) {
                                // some magic here to remove all indents and show new link from beginning.
                                // no rules. Just for test and it works here
                                COSArray newLink = (COSArray) newTokens.get(newTokens.size() - 1);
                                int size = newLink.size();
                                float f = ((COSFloat) newLink.get(size - 4)).floatValue();
                                for (int i = 0; i < size - 4; i++) {
                                    newLink.remove(0);
                                }
                                newLink.set(0, new COSString("test.test.com"));
                                // number for padding of date from right place. Should be checked.
                                newLink.set(1, new COSFloat(f - 8000));
                                newTokens.set(newTokens.size() - 1, newLink);
                            }
                        }
                    }
                    newTokens.add(token);
                }

                // save replaced content inside a page
                PDStream newContents = new PDStream(document);
                OutputStream out = newContents.createOutputStream(COSName.FLATE_DECODE);
                ContentStreamWriter writer = new ContentStreamWriter(out);
                writer.writeTokens(newTokens);
                out.close();
                page.setContents(newContents);

                // replace all links that have a pop-up line
                pageNum++;
                List<PDAnnotation> annotations = page.getAnnotations();
                for (PDAnnotation annotation : annotations) {
                    PDAnnotation annot = annotation;
                    if (annot instanceof PDAnnotationLink) {
                        PDAnnotationLink link = (PDAnnotationLink) annot;
                        PDAction action = link.getAction();
                        if (action instanceof PDActionURI) {
                            PDActionURI uri = (PDActionURI) action;
                            String newURI = "www.test1.test1.com";
                            uri.setURI(newURI);
                        }
                    }
                }
            }
            // save file
            document.save(filePath.replace("file", "file_result"));
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (document != null) {
                try {
                    document.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }

编辑3.

pdf包含660 0 obj,其中包含必要的链接:

代码语言:javascript
运行
复制
660 0 obj
<<
/BBox [0.0 792.0 612.0 0.0]
/Length 792
/Matrix [1.0 0.0 0.0 1.0 0.0 0.0]
/Resources <<
/ColorSpace <<
/CS0 [/ICCBased 21 0 R]
>>
/ExtGState <<
/GS0 22 0 R
>>
/Font <<
/T1_0 834 0 R
/T1_1 835 0 R
/T1_2 836 0 R
>>
/ProcSet [/PDF /Text]
>>
/Subtype /Form
>>
stream
/CS0 CS 0.412 0.416 0.423  SCN
0.25 w 
/GS0 gs
q 1 0 0 1 72 78.425 cm
0 0 m
468 0 l
S
Q
/Artifact <</O /Layout >>BDC 
BT
/CS0 cs 0.412 0.416 0.423  scn
/T1_0 1 Tf
0 Tc 0 Tw 0 Ts 100 Tz 0 Tr 8 0 0 8 72 64.8 Tm
[(Visit )35(O)7(ur site R)23.1(esear)15.1(ch Manager )20.1(on )20(the )12(web at )]TJ
ET
EMC 
/Artifact <</O /Layout >>BDC 
BT
/T1_1 1 Tf
8 0 0 8 237.0609 64.8 Tm
[(www)11(.lin)-14.9(kshou)-10(ldbeh)-8(ere)-7.9(ninechars)15.1(.com)]TJ
/Span<</ActualText<FEFF0009>>> BDC 
( )Tj
EMC 
31.954 0 Td
[(A)15(ugust 7)45.1(,)-5( 2015)]TJ
ET
EMC 
/Artifact <</O /Layout >>BDC 
BT
/T1_0 1 Tf
8 0 0 8 540 64.8 Tm
( )Tj
ET
EMC 
/Artifact <</O /Layout >>BDC 
BT
/T1_2 1 Tf
7 0 0 7 72 55.3 Tm
[(\251 2015 )29(CCH Incorporated and its af\037liates. )38.3(All rights r)12(eserv)8.1(ed.)]TJ
ET
EMC 

endstream

只有一个地方我发现它是从pdf文件中调用的。是从45号来的

代码语言:javascript
运行
复制
/XObject <<
    /Fm0 660 0 R
    /Fm1 661 0 R
>>

来自obj的全文:

代码语言:javascript
运行
复制
45 0 obj
<<
/ArtBox [0.0 0.0 612.0 792.0]
/BleedBox [0.0 0.0 612.0 792.0]
/Contents 658 0 R
/CropBox [0.0 0.0 612.0 792.0]
/Group 659 0 R
/MediaBox [0.0 0.0 612.0 792.0]
/Parent 13 0 R
/Resources <<
/ColorSpace <<
/CS0 [/ICCBased 21 0 R]
>>
/ExtGState <<
/GS0 22 0 R
/GS1 23 0 R
>>
/Font <<
/T1_0 597 0 R
/T1_1 26 0 R
/T1_2 28 0 R
/T1_3 25 0 R
>>
/ProcSet [/PDF /Text]
/XObject <<
/Fm0 660 0 R
/Fm1 661 0 R
>>
>>
/Rotate 0
/StructParents 22
/Tabs /W
/Thumb 662 0 R
/TrimBox [0.0 0.0 612.0 792.0]
/Type /Page
/Annots []
>>
endobj

一个问题是,我能得到这个660 0 obj并通过PDFBox处理它吗?因为看起来PDFStreamParser解析器对这个660 0对象一无所知。谢谢。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-09-18 11:23:09

对于PDFBox 2.0.0-快照。这是我的代码,在链接替换的情况下,我可以很好地工作。

非常感谢蒂尔曼·豪舍尔的帮助。

代码语言:javascript
运行
复制
String filePath = "d:\\pdf\\file1.pdf"

..。

代码语言:javascript
运行
复制
public static void replaceLinksInPdf(String filePath) {
        PDDocument document = null;
        try {
            document = PDDocument.load(new File(filePath));
            // Decrypt a document
            if (document.isEncrypted()) {
                document.setAllSecurityToBeRemoved(true);
                System.out.println(filePath + " Doc was decrypted");
            }

            // replace all links in a footer and a header in XObjects with /ProcSet [/PDF /Text]
            // Note: these forms (and pattern objects too!) can have resources,
            // i.e. have Form XObjects or patterns again.
            // If so you need to use a recursion
            for (int pageNum = 0; pageNum < document.getPages().getCount(); pageNum++) {
                List<Object> newPdxTokens = new ArrayList<Object>();
                // Get all XObjects from the page
                Iterable<COSName> xobjs = document.getPage(pageNum).getResources().getXObjectNames();
                for (COSName xobj : xobjs) {
                    boolean isHasTextStream = false;
                    PDXObject pdxObject = document.getPage(pageNum).getResources().getXObject(xobj);
                    // If a stream has not '/ProcSet [/PDF /Text]' line inside it has to be skipped
                    // isXobjectHasTextFieldInPdf has a recursion
                    if (pdxObject.getCOSObject() instanceof COSDictionary) {
                        isHasTextStream = isXobjectHasTextFieldInPdf((COSDictionary) pdxObject.getCOSObject());
                    }

                    if (pdxObject instanceof PDFormXObject && isHasTextStream) {
                        // Set stream from pdxObject
                        PDStream stream = pdxObject.getStream();
                        PDFStreamParser streamParser = new PDFStreamParser(stream.toByteArray());
                        streamParser.parse();
                        for (Object token : streamParser.getTokens()) {
                            if (token instanceof Operator) {
                                Operator op = (Operator) token;
                                if (op.getName().equals("Tj")) {
                                    // Tj contains 1 COSString
                                    COSString previous = (COSString) newPdxTokens.get(newPdxTokens.size() - 1);
                                    String string = previous.getString();
                                    // here can be any filters for checking a necessary string
                                    if (string.equals("www.testlink.com")) {
                                        COSArray newLink = new COSArray();
                                        newLink.add(new COSString("test.test.com"));
                                        newPdxTokens.set(newPdxTokens.size() - 1, newLink);
                                    }
                                } else if (op.getName().equals("TJ")) {
                                    // TJ contains a COSArray with COSStrings and COSFloat (padding)
                                    COSArray previous = (COSArray) newPdxTokens.get(newPdxTokens.size() - 1);
                                    String string = "";
                                    for (int k = 0; k < previous.size(); k++) {
                                        Object arrElement = previous.getObject(k);
                                        if (arrElement instanceof COSString) {
                                            COSString cosString = (COSString) arrElement;
                                            String content = cosString.getString();
                                            string += content;
                                        }
                                    }
                                    // here can be any filters for checking a necessary string
                                    // check if string contains a necessary link
                                    if (string.equals("www.testlink.com")) {
                                        COSArray newLink = new COSArray();
                                        newLink.add(new COSString("test.test.com"));
                                        newPdxTokens.set(newPdxTokens.size() - 1, newLink);
                                    } else if (string.startsWith("www.testlink.com")) {
                                        // this code should be changed. It can have some indenting problems that depend on COSFloat values
                                        COSArray newLink = (COSArray) newPdxTokens.get(newPdxTokens.size() - 1);
                                        int size = newLink.size();
                                        float f = ((COSFloat) newLink.get(size - 4)).floatValue();
                                        for (int i = 0; i < size - 4; i++) {
                                            newLink.remove(0);
                                        }
                                        newLink.set(0, new COSString("test.test.com"));
                                        // number for indenting from right place. Should be checked.
                                        newLink.set(1, new COSFloat(f - 8000));
                                        newPdxTokens.set(newPdxTokens.size() - 1, newLink);
                                    }
                                }
                            }
                            // save tokens to a temporary List
                            newPdxTokens.add(token);
                        }
                        // save the replaced data back to the srteam
                        OutputStream out = stream.createOutputStream();
                        ContentStreamWriter writer = new ContentStreamWriter(out);
                        writer.writeTokens(newPdxTokens);
                        out.close();
                    }
                }
            }

            // replace data from any text stream from pdf. XObjects not included.
            int pageNum = 0;
            for (PDPage page : document.getPages()) {
                PDFStreamParser parser = new PDFStreamParser(page);
                parser.parse();
                // Get all tokens from the page
                List<Object> tokens = parser.getTokens();
                // Create a temporary List
                List<Object> newTokens = new ArrayList<Object>();

                for (Object token : tokens) {
                    if (token instanceof Operator) {
                        COSDictionary dictionary = ((Operator) token).getImageParameters();
                        if (dictionary != null) {
                            System.out.println(dictionary.toString());
                        }
                    }
                    if (token instanceof Operator) {
                        Operator op = (Operator) token;
                        if (op.getName().equals("Tj")) {
                            // Tj contains 1 COSString
                            COSString previous = (COSString) newTokens.get(newTokens.size() - 1);
                            String string = previous.getString();
                            // here can be any filters for checking a necessary string
                            if (string.equals("www.testlink.com")) {
                                COSArray newLink = new COSArray();
                                newLink.add(new COSString("test2.test2.com"));
                                newTokens.set(newTokens.size() - 1, newLink);
                            }
                        } else if (op.getName().equals("TJ")) {
                            // TJ contains a COSArray with COSStrings and COSFloat (padding)
                            COSArray previous = (COSArray) newTokens.get(newTokens.size() - 1);
                            String string = "";
                            for (int k = 0; k < previous.size(); k++) {
                                Object arrElement = previous.getObject(k);
                                if (arrElement instanceof COSString) {
                                    COSString cosString = (COSString) arrElement;
                                    String content = cosString.getString();
                                    string += content;
                                }
                            }
                            // here can be any filters for checking a necessary string
                            if (string.equals("www.testlink.com")) {
                                COSArray newLink = new COSArray();
                                newLink.add(new COSString("test.test.com"));
                                newTokens.set(newTokens.size() - 1, newLink);
                            } else if (string.startsWith("www.testlink.com")) {
                                // this code should be changed. It can have some indenting problems that depend on COSFloat values
                                COSArray newLink = (COSArray) newTokens.get(newTokens.size() - 1);
                                int size = newLink.size();
                                float f = ((COSFloat) newLink.get(size - 4)).floatValue();
                                for (int i = 0; i < size - 4; i++) {
                                    newLink.remove(0);
                                }
                                newLink.set(0, new COSString("test.test.com"));
                                // number for padding from right place. Should be checked.
                                newLink.set(1, new COSFloat(f - 8000));
                                newTokens.set(newTokens.size() - 1, newLink);
                            }
                        }
                    }
                    // save tokens to a temporary List
                    newTokens.add(token);
                }
                // save the replaced data back to the document's srteam
                PDStream newContents = new PDStream(document);
                OutputStream out = newContents.createOutputStream(COSName.FLATE_DECODE);
                ContentStreamWriter writer = new ContentStreamWriter(out);
                writer.writeTokens(newTokens);
                out.close();

                // save content
                page.setContents(newContents);

                // replace all links that have a pop-up line (It does not affect the visible text)
                pageNum++;
                List<PDAnnotation> annotations = page.getAnnotations();
                for (PDAnnotation annotation : annotations) {
                    PDAnnotation annot = annotation;
                    if (annot instanceof PDAnnotationLink) {
                        PDAnnotationLink link = (PDAnnotationLink) annot;
                        PDAction action = link.getAction();
                        if (action instanceof PDActionURI) {
                            PDActionURI uri = (PDActionURI) action;
                            String newURI = "www.test1.test1.com";
                            uri.setURI(newURI);
                        }
                    }
                }
            }

            // save document
            document.save(filePath.replace("file", "file_result"));
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (document != null) {
                try {
                    document.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }

一种只处理文本流和跳过图像流的额外方法。它是从主方法"replaceLinksInPdf(String filePath)“调用的。

代码语言:javascript
运行
复制
        // Check if COSDictionary has '/ProcSet [/PDF /Text]' string in the stream
        private static boolean isXobjectHasTextFieldInPdf(COSDictionary dictionary) {
            boolean isHasTextField = false;
            for (COSBase cosBase : dictionary.getValues()) {
                // go to a recursion because COSDictionary can have COSDictionaries inside
                if (cosBase instanceof COSDictionary) {
                    COSDictionary cosDictionaryNew = (COSDictionary) cosBase;
                    // check if '/ProcSet' has '/Text' param
                    if (cosDictionaryNew.containsKey(COSName.PROC_SET)) {
                        COSBase procSet = cosDictionaryNew.getDictionaryObject(COSName.PROC_SET);
                        if (procSet instanceof COSArray) {
                            for (COSBase procSetIterator : ((COSArray) procSet)) {
                                if (procSetIterator instanceof COSName
                                        && ((COSName) procSetIterator).getName().equals("Text")) {
                                    return true;
                                }
                            }
                        } else if (procSet instanceof COSString && ((COSString) procSet).getString().equals("Text")) {
                            return true;
                        }
                    }
                    // go to the COSDictionary children
                    isHasTextField = isXobjectHasTextFieldInPdf(cosDictionaryNew);
                }
            }
            return isHasTextField;
        }

它只是我的项目的一个测试变体。我将用项目的规则重构这段代码。你应该根据你的需要更换替代品。另外,我正在使用这个PDFBox 2.0.0库,大约一周,也许任何人都可以找到更简单的方法来执行一些代码。可以自由地进行代码评审,并发布更合适的变体。谢谢。

我在上面测试了40个PDF,其中只有2个必须进行深度处理,以防递归。所有40个文件都可以打开,可读的,看起来像以前的版本,除了链接。

票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/32617343

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档