文章/答案/技术大牛

发布

社区首页 >问答首页 >如何打开和替换java apache中的PDFBox流中的数据？

问如何打开和替换java apache中的PDFBox流中的数据？
EN

Stack Overflow用户

提问于 2015-09-16 19:43:22

回答 1查看 5.9K关注 0票数 1

我在我的java代码(Java1.6)中使用了apache 2.0.0版本。我正在努力弄清楚如何获得、替换并保存到我的pdf中

<stream> data here... <endstream> ?

我的pdf文件如下：

596 0 obj
<<
/Filter /FlateDecode
/Length 3739
>>
stream
xњ[ЫnЬF}џoШ8эІАђhЮ/‰`@С%Hvќd-н“іXPJГ ...
endstream
endobj

我找到了一个解决方案，我可以解码这条流。我使用了pdfbox 1.8.10.jar api中的"WriteDecodedDoc“命令。所以现在我有了文件的两个变体，但是我不知道如何处理这个流。此流包含页脚和页眉，其中放置了图像和文本。

我用PDFTextStripper类检查了我的文件。它可以从流中看到必要的数据，但是我不能使用这个类来替换数据并将数据保存回pdf文件。

我试着替换这个文本，只是打开一个文件作为文本，搜索文本，只在流中替换它并保存。但我对“无法提取嵌入式字体.”有一个问题。主要原因是我丢失了编码。我试着改变这个编码，但对我没有帮助。

顺便说一下，我不能使用iText。我应该在这里用免费的唇语。

谢谢你的解决方案。

编辑：

解码后，我的流就像

stream
/CS0 CS 0.412 0.416 0.423  SCN
0.25 w 
/GS0 gs
q 1 0 0 1 72 78.425 cm
0 0 m
468 0 l
S
Q
/Span <</Lang (en-US)/MCID 83 >>BDC 
BT
/T1_1 1 Tf
8 0 0 8 237.0609 64.8 Tm
[(www)11(.li)-14.9(nkto)-10(thesi)-8(tesho)-7.9(ouldbehere)15.1(.com)]TJ
/Span<</ActualText<FEFF0009>>> BDC 
( )Tj
endstream

我需要更换一个链接到一个不同的链接内流。这个：

[(www)11(.li)-14.9(nkto)-10(thesi)-8(tesho)-7.9(ouldbehere)15.1(.com)]TJ

编辑2代码

public static void replaceLinksInPdf(String filePath) {
        PDDocument document = null;
        try {
            document = PDDocument.load(new File(filePath));
            if (document.isEncrypted()) {
                document.setAllSecurityToBeRemoved(true);
                System.out.println(filePath + " Doc was decrypted");
            }

            // COSBase cosb = document.getDocument().getObjects().get(27);
            // e.g. this object contains <stream> bytecode <endstream> in the PDF file.
            // it looks that
            // document -> getDocument() -> objectPool #27 -> baseObject -> randomAccess -> bufferList size 10 has a data that I can't open and work
            // document -> getDocument() -> objectPool #27 -> baseObject -> items -> all PDF's tag but NO a stream section

            int pageNum = 0;
            for (PDPage page : document.getPages()) {
                PDFStreamParser parser = new PDFStreamParser(page);
                parser.parse();
                List<Object> tokens = parser.getTokens();
                List<Object> newTokens = new ArrayList<Object>();

                for (Object token : tokens) {
                    if (token instanceof Operator) {
                        COSDictionary dictionary = ((Operator) token).getImageParameters();
                        if (dictionary != null) {
                            System.out.println(dictionary.toString());
                        }
                    }
                    if (token instanceof Operator) {
                        Operator op = (Operator) token;
                        if (op.getName().equals("Tj")) {
                            // Tj contains 1 COSString
                            COSString previous = (COSString) newTokens.get(newTokens.size() - 1);
                            String string = previous.getString();
                            // check if string contains a necessary link
                            if (string.equals("www.linkhouldbehere.com")) {
                                COSArray newLink = new COSArray();
                                newLink.add(new COSString("test2.test2.com"));
                                newTokens.set(newTokens.size() - 1, newLink);
                            }
                        } else if (op.getName().equals("TJ")) {
                            // TJ contains a COSArray with COSStrings and COSFloat (padding)
                            COSArray previous = (COSArray) newTokens.get(newTokens.size() - 1);
                            String string = "";
                            for (int k = 0; k < previous.size(); k++) {
                                Object arrElement = previous.getObject(k);
                                if (arrElement instanceof COSString) {
                                    COSString cosString = (COSString) arrElement;
                                    String content = cosString.getString();
                                    string += content;
                                }
                            }
                            // check if string contains a necessary link
                            if (string.equals("www.linkhouldbehere.com")) {
                                COSArray newLink = new COSArray();
                                newLink.add(new COSString("test.test.com"));
                                newTokens.set(newTokens.size() - 1, newLink);
                            } else if (string.startsWith("www.linkhouldbehere.com")) {
                                // some magic here to remove all indents and show new link from beginning.
                                // no rules. Just for test and it works here
                                COSArray newLink = (COSArray) newTokens.get(newTokens.size() - 1);
                                int size = newLink.size();
                                float f = ((COSFloat) newLink.get(size - 4)).floatValue();
                                for (int i = 0; i < size - 4; i++) {
                                    newLink.remove(0);
                                }
                                newLink.set(0, new COSString("test.test.com"));
                                // number for padding of date from right place. Should be checked.
                                newLink.set(1, new COSFloat(f - 8000));
                                newTokens.set(newTokens.size() - 1, newLink);
                            }
                        }
                    }
                    newTokens.add(token);
                }

                // save replaced content inside a page
                PDStream newContents = new PDStream(document);
                OutputStream out = newContents.createOutputStream(COSName.FLATE_DECODE);
                ContentStreamWriter writer = new ContentStreamWriter(out);
                writer.writeTokens(newTokens);
                out.close();
                page.setContents(newContents);

                // replace all links that have a pop-up line
                pageNum++;
                List<PDAnnotation> annotations = page.getAnnotations();
                for (PDAnnotation annotation : annotations) {
                    PDAnnotation annot = annotation;
                    if (annot instanceof PDAnnotationLink) {
                        PDAnnotationLink link = (PDAnnotationLink) annot;
                        PDAction action = link.getAction();
                        if (action instanceof PDActionURI) {
                            PDActionURI uri = (PDActionURI) action;
                            String newURI = "www.test1.test1.com";
                            uri.setURI(newURI);
                        }
                    }
                }
            }
            // save file
            document.save(filePath.replace("file", "file_result"));
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (document != null) {
                try {
                    document.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }

编辑3.

pdf包含660 0 obj，其中包含必要的链接：

660 0 obj
<<
/BBox [0.0 792.0 612.0 0.0]
/Length 792
/Matrix [1.0 0.0 0.0 1.0 0.0 0.0]
/Resources <<
/ColorSpace <<
/CS0 [/ICCBased 21 0 R]
>>
/ExtGState <<
/GS0 22 0 R
>>
/Font <<
/T1_0 834 0 R
/T1_1 835 0 R
/T1_2 836 0 R
>>
/ProcSet [/PDF /Text]
>>
/Subtype /Form
>>
stream
/CS0 CS 0.412 0.416 0.423  SCN
0.25 w 
/GS0 gs
q 1 0 0 1 72 78.425 cm
0 0 m
468 0 l
S
Q
/Artifact <</O /Layout >>BDC 
BT
/CS0 cs 0.412 0.416 0.423  scn
/T1_0 1 Tf
0 Tc 0 Tw 0 Ts 100 Tz 0 Tr 8 0 0 8 72 64.8 Tm
[(Visit )35(O)7(ur site R)23.1(esear)15.1(ch Manager )20.1(on )20(the )12(web at )]TJ
ET
EMC 
/Artifact <</O /Layout >>BDC 
BT
/T1_1 1 Tf
8 0 0 8 237.0609 64.8 Tm
[(www)11(.lin)-14.9(kshou)-10(ldbeh)-8(ere)-7.9(ninechars)15.1(.com)]TJ
/Span<</ActualText<FEFF0009>>> BDC 
( )Tj
EMC 
31.954 0 Td
[(A)15(ugust 7)45.1(,)-5( 2015)]TJ
ET
EMC 
/Artifact <</O /Layout >>BDC 
BT
/T1_0 1 Tf
8 0 0 8 540 64.8 Tm
( )Tj
ET
EMC 
/Artifact <</O /Layout >>BDC 
BT
/T1_2 1 Tf
7 0 0 7 72 55.3 Tm
[(\251 2015 )29(CCH Incorporated and its af\037liates. )38.3(All rights r)12(eserv)8.1(ed.)]TJ
ET
EMC 

endstream

只有一个地方我发现它是从pdf文件中调用的。是从45号来的

/XObject <<
    /Fm0 660 0 R
    /Fm1 661 0 R
>>

来自obj的全文：

45 0 obj
<<
/ArtBox [0.0 0.0 612.0 792.0]
/BleedBox [0.0 0.0 612.0 792.0]
/Contents 658 0 R
/CropBox [0.0 0.0 612.0 792.0]
/Group 659 0 R
/MediaBox [0.0 0.0 612.0 792.0]
/Parent 13 0 R
/Resources <<
/ColorSpace <<
/CS0 [/ICCBased 21 0 R]
>>
/ExtGState <<
/GS0 22 0 R
/GS1 23 0 R
>>
/Font <<
/T1_0 597 0 R
/T1_1 26 0 R
/T1_2 28 0 R
/T1_3 25 0 R
>>
/ProcSet [/PDF /Text]
/XObject <<
/Fm0 660 0 R
/Fm1 661 0 R
>>
>>
/Rotate 0
/StructParents 22
/Tabs /W
/Thumb 662 0 R
/TrimBox [0.0 0.0 612.0 792.0]
/Type /Page
/Annots []
>>
endobj

一个问题是，我能得到这个660 0 obj并通过PDFBox处理它吗？因为看起来PDFStreamParser解析器对这个660 0对象一无所知。谢谢。

java

pdf

stream

pdfbox

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-09-18 11:23:09

对于PDFBox 2.0.0-快照。这是我的代码，在链接替换的情况下，我可以很好地工作。

非常感谢蒂尔曼·豪舍尔的帮助。

String filePath = "d:\\pdf\\file1.pdf"

..。

public static void replaceLinksInPdf(String filePath) {
        PDDocument document = null;
        try {
            document = PDDocument.load(new File(filePath));
            // Decrypt a document
            if (document.isEncrypted()) {
                document.setAllSecurityToBeRemoved(true);
                System.out.println(filePath + " Doc was decrypted");
            }

            // replace all links in a footer and a header in XObjects with /ProcSet [/PDF /Text]
            // Note: these forms (and pattern objects too!) can have resources,
            // i.e. have Form XObjects or patterns again.
            // If so you need to use a recursion
            for (int pageNum = 0; pageNum < document.getPages().getCount(); pageNum++) {
                List<Object> newPdxTokens = new ArrayList<Object>();
                // Get all XObjects from the page
                Iterable<COSName> xobjs = document.getPage(pageNum).getResources().getXObjectNames();
                for (COSName xobj : xobjs) {
                    boolean isHasTextStream = false;
                    PDXObject pdxObject = document.getPage(pageNum).getResources().getXObject(xobj);
                    // If a stream has not '/ProcSet [/PDF /Text]' line inside it has to be skipped
                    // isXobjectHasTextFieldInPdf has a recursion
                    if (pdxObject.getCOSObject() instanceof COSDictionary) {
                        isHasTextStream = isXobjectHasTextFieldInPdf((COSDictionary) pdxObject.getCOSObject());
                    }

                    if (pdxObject instanceof PDFormXObject && isHasTextStream) {
                        // Set stream from pdxObject
                        PDStream stream = pdxObject.getStream();
                        PDFStreamParser streamParser = new PDFStreamParser(stream.toByteArray());
                        streamParser.parse();
                        for (Object token : streamParser.getTokens()) {
                            if (token instanceof Operator) {
                                Operator op = (Operator) token;
                                if (op.getName().equals("Tj")) {
                                    // Tj contains 1 COSString
                                    COSString previous = (COSString) newPdxTokens.get(newPdxTokens.size() - 1);
                                    String string = previous.getString();
                                    // here can be any filters for checking a necessary string
                                    if (string.equals("www.testlink.com")) {
                                        COSArray newLink = new COSArray();
                                        newLink.add(new COSString("test.test.com"));
                                        newPdxTokens.set(newPdxTokens.size() - 1, newLink);
                                    }
                                } else if (op.getName().equals("TJ")) {
                                    // TJ contains a COSArray with COSStrings and COSFloat (padding)
                                    COSArray previous = (COSArray) newPdxTokens.get(newPdxTokens.size() - 1);
                                    String string = "";
                                    for (int k = 0; k < previous.size(); k++) {
                                        Object arrElement = previous.getObject(k);
                                        if (arrElement instanceof COSString) {
                                            COSString cosString = (COSString) arrElement;
                                            String content = cosString.getString();
                                            string += content;
                                        }
                                    }
                                    // here can be any filters for checking a necessary string
                                    // check if string contains a necessary link
                                    if (string.equals("www.testlink.com")) {
                                        COSArray newLink = new COSArray();
                                        newLink.add(new COSString("test.test.com"));
                                        newPdxTokens.set(newPdxTokens.size() - 1, newLink);
                                    } else if (string.startsWith("www.testlink.com")) {
                                        // this code should be changed. It can have some indenting problems that depend on COSFloat values
                                        COSArray newLink = (COSArray) newPdxTokens.get(newPdxTokens.size() - 1);
                                        int size = newLink.size();
                                        float f = ((COSFloat) newLink.get(size - 4)).floatValue();
                                        for (int i = 0; i < size - 4; i++) {
                                            newLink.remove(0);
                                        }
                                        newLink.set(0, new COSString("test.test.com"));
                                        // number for indenting from right place. Should be checked.
                                        newLink.set(1, new COSFloat(f - 8000));
                                        newPdxTokens.set(newPdxTokens.size() - 1, newLink);
                                    }
                                }
                            }
                            // save tokens to a temporary List
                            newPdxTokens.add(token);
                        }
                        // save the replaced data back to the srteam
                        OutputStream out = stream.createOutputStream();
                        ContentStreamWriter writer = new ContentStreamWriter(out);
                        writer.writeTokens(newPdxTokens);
                        out.close();
                    }
                }
            }

            // replace data from any text stream from pdf. XObjects not included.
            int pageNum = 0;
            for (PDPage page : document.getPages()) {
                PDFStreamParser parser = new PDFStreamParser(page);
                parser.parse();
                // Get all tokens from the page
                List<Object> tokens = parser.getTokens();
                // Create a temporary List
                List<Object> newTokens = new ArrayList<Object>();

                for (Object token : tokens) {
                    if (token instanceof Operator) {
                        COSDictionary dictionary = ((Operator) token).getImageParameters();
                        if (dictionary != null) {
                            System.out.println(dictionary.toString());
                        }
                    }
                    if (token instanceof Operator) {
                        Operator op = (Operator) token;
                        if (op.getName().equals("Tj")) {
                            // Tj contains 1 COSString
                            COSString previous = (COSString) newTokens.get(newTokens.size() - 1);
                            String string = previous.getString();
                            // here can be any filters for checking a necessary string
                            if (string.equals("www.testlink.com")) {
                                COSArray newLink = new COSArray();
                                newLink.add(new COSString("test2.test2.com"));
                                newTokens.set(newTokens.size() - 1, newLink);
                            }
                        } else if (op.getName().equals("TJ")) {
                            // TJ contains a COSArray with COSStrings and COSFloat (padding)
                            COSArray previous = (COSArray) newTokens.get(newTokens.size() - 1);
                            String string = "";
                            for (int k = 0; k < previous.size(); k++) {
                                Object arrElement = previous.getObject(k);
                                if (arrElement instanceof COSString) {
                                    COSString cosString = (COSString) arrElement;
                                    String content = cosString.getString();
                                    string += content;
                                }
                            }
                            // here can be any filters for checking a necessary string
                            if (string.equals("www.testlink.com")) {
                                COSArray newLink = new COSArray();
                                newLink.add(new COSString("test.test.com"));
                                newTokens.set(newTokens.size() - 1, newLink);
                            } else if (string.startsWith("www.testlink.com")) {
                                // this code should be changed. It can have some indenting problems that depend on COSFloat values
                                COSArray newLink = (COSArray) newTokens.get(newTokens.size() - 1);
                                int size = newLink.size();
                                float f = ((COSFloat) newLink.get(size - 4)).floatValue();
                                for (int i = 0; i < size - 4; i++) {
                                    newLink.remove(0);
                                }
                                newLink.set(0, new COSString("test.test.com"));
                                // number for padding from right place. Should be checked.
                                newLink.set(1, new COSFloat(f - 8000));
                                newTokens.set(newTokens.size() - 1, newLink);
                            }
                        }
                    }
                    // save tokens to a temporary List
                    newTokens.add(token);
                }
                // save the replaced data back to the document's srteam
                PDStream newContents = new PDStream(document);
                OutputStream out = newContents.createOutputStream(COSName.FLATE_DECODE);
                ContentStreamWriter writer = new ContentStreamWriter(out);
                writer.writeTokens(newTokens);
                out.close();

                // save content
                page.setContents(newContents);

                // replace all links that have a pop-up line (It does not affect the visible text)
                pageNum++;
                List<PDAnnotation> annotations = page.getAnnotations();
                for (PDAnnotation annotation : annotations) {
                    PDAnnotation annot = annotation;
                    if (annot instanceof PDAnnotationLink) {
                        PDAnnotationLink link = (PDAnnotationLink) annot;
                        PDAction action = link.getAction();
                        if (action instanceof PDActionURI) {
                            PDActionURI uri = (PDActionURI) action;
                            String newURI = "www.test1.test1.com";
                            uri.setURI(newURI);
                        }
                    }
                }
            }

            // save document
            document.save(filePath.replace("file", "file_result"));
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (document != null) {
                try {
                    document.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }

一种只处理文本流和跳过图像流的额外方法。它是从主方法"replaceLinksInPdf(String filePath)“调用的。

        // Check if COSDictionary has '/ProcSet [/PDF /Text]' string in the stream
        private static boolean isXobjectHasTextFieldInPdf(COSDictionary dictionary) {
            boolean isHasTextField = false;
            for (COSBase cosBase : dictionary.getValues()) {
                // go to a recursion because COSDictionary can have COSDictionaries inside
                if (cosBase instanceof COSDictionary) {
                    COSDictionary cosDictionaryNew = (COSDictionary) cosBase;
                    // check if '/ProcSet' has '/Text' param
                    if (cosDictionaryNew.containsKey(COSName.PROC_SET)) {
                        COSBase procSet = cosDictionaryNew.getDictionaryObject(COSName.PROC_SET);
                        if (procSet instanceof COSArray) {
                            for (COSBase procSetIterator : ((COSArray) procSet)) {
                                if (procSetIterator instanceof COSName
                                        && ((COSName) procSetIterator).getName().equals("Text")) {
                                    return true;
                                }
                            }
                        } else if (procSet instanceof COSString && ((COSString) procSet).getString().equals("Text")) {
                            return true;
                        }
                    }
                    // go to the COSDictionary children
                    isHasTextField = isXobjectHasTextFieldInPdf(cosDictionaryNew);
                }
            }
            return isHasTextField;
        }

它只是我的项目的一个测试变体。我将用项目的规则重构这段代码。你应该根据你的需要更换替代品。另外，我正在使用这个PDFBox 2.0.0库，大约一周，也许任何人都可以找到更简单的方法来执行一些代码。可以自由地进行代码评审，并发布更合适的变体。谢谢。

我在上面测试了40个PDF，其中只有2个必须进行深度处理，以防递归。所有40个文件都可以打开，可读的，看起来像以前的版本，除了链接。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/32617343

复制

相似问题

问如何打开和替换java apache中的PDFBox流中的数据？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何打开和替换java apache中的PDFBox流中的数据？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何打开和替换java apache中的PDFBox流中的数据？
EN