Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

WaterMark Large Spool File

Help
Visitor
2013-04-02
2013-05-28
1 2 > >> (Page 1 of 2)
  • Visitor
    Visitor
    2013-04-02

    Hi all,
    I'm evaluating the possibility to use jpod to add a watermark in each page of a big document:
    100.000 Pages
    500Mb
    I would avoid to load the whole stream in memory
    can I do it with JPod
    Thanks.
    Marco

     
  • Visitor
    Visitor
    2013-04-02

    To be specific I can only use 256Mb of memory for the whole run.

     
  • mtraut
    mtraut
    2013-04-03

    In theory, yes.

    You can walk through the page tree, change each page, SAVE (incremental). Then the CosIndirectObject should be eligible to garbage collection as it is held via SoftReference only.

    As this is not standard stuff in our usage scenarios, this MAY fail first because of references that still hold the pages in memory, but it should not. As far as i remember there have been issues in your usage regarding the font registry? Anyway, IF you have problems AND come up with issues that are tracked down to a garbage collection root for this scenario, i'm glad to fix this (better: try).

     
  • Visitor
    Visitor
    2013-04-11

    hi
    I tried to do the following

    https://friendpaste.com/3YzlS3bvTcXEPK3FZkl7nb

    with a pdf of 400Mb

    but CDSRectangle rect = pageTree.getPageAt(i).getCropBox();

    takes ages to be executed just at the first step and already get 800Mb of RAM.

    What did I missed ?
    Thanks for your support.
    Marco

     
  • Visitor
    Visitor
    2013-04-12

    I rewrote it
    but the problem persist
    I can't find a way to iterate over the pages without reading all of them.
    Does it exists ?

     
  • mtraut
    mtraut
    2013-04-12

    1) Looking at the jPod implementation should reveal very easily that PDPageTree.getPageAt() is quite inefficient to use in this case as it is a "linked tree" structure, not direct access. It will start from page one every time. To navigate through pages use "next" and previous".

    2) Then instead of reusing the page you have, you start looking up all over again in the loop. This is at least 3 times quadratic cost. This will not speed up use.

    3) There's no reason to parse and serialize the page content as you do. You want to add another layer, regardless of old content. Just add the old stream and the new stream.

    4) manipulating the old page will surely keep it in memory from now on, while being completely unnecessary.

    5) Creating the new page tree efficiently was already the topic of a 25 message long conversation some days ago - with you. Maybe you look it up.

    Look up comments in your code.

     
  • Visitor
    Visitor
    2013-04-12

    Yes I restarted the code from the balanced tree page  done with the test of some days ago
    Now I ended up at this point:
    https://friendpaste.com/69voaS1et3ImpN2P2ymdYK

    I have 2 question
    The bottleneck now is the deepCopy which is mandatory otherwise I'll get an error regarding the fact that I cannot copy objects from one locator to another
    the other question is how can I merge the PDForm with a page ?

    Thanks
    Marco

     
  • mtraut
    mtraut
    2013-04-15

    "deepCopy" on a page is never a good idea, as it has back references. You copy the whole document this way.

    You can simply copy the CONTENT STREAM of the old page (check resources beforehand) and add the stream to the new page. the watermark will be another stream for the new page.

    even better would be to manipulate the original document directly by simply adding (prepending) the watermark.

     
  • Visitor
    Visitor
    2013-04-16

    Ok I've modified the code according to your suggestion
    https://friendpaste.com/69voaS1et3ImpN2P2ymdYK

    but now I got a:
    java.lang.ClassCastException: de.intarsys.pdf.cos.COSDictionary cannot be cast to de.intarsys.pdf.cos.COSStream
    at de.intarsys.pdf.cos.COSBasedObject.cosGetStream(COSBasedObject.java:281)
    at WaterMarkingJpod.run(WaterMarkingJpod.java:146)
    at WaterMarkingJpod.runTest(WaterMarkingJpod.java:66)
    at WaterMarkingJpod.main(WaterMarkingJpod.java:57)

    This is probably due to the fact that I don't know how to implement your suggestion :)
    Could you help me please?
    Thanks
    Marco

     
  • Visitor
    Visitor
    2013-04-16

    I've also tried to use origPage as argument for
    PDPage.META.createFromCos
    but I got the error:

    java.lang.IllegalStateException: You can not merge objects from different documents
    at de.intarsys.pdf.cos.COSIndirectObject.registerWith(COSIndirectObject.java:514)
    at de.intarsys.pdf.cos.COSCompositeObject.register(COSCompositeObject.java:250)
    at de.intarsys.pdf.cos.COSIndirectObject.associate(COSIndirectObject.java:189)
    at de.intarsys.pdf.cos.COSIndirectObject.addContainer(COSIndirectObject.java:178)
    at de.intarsys.pdf.cos.COSDictionary.basicPutPropagate(COSDictionary.java:242)
    at de.intarsys.pdf.cos.COSDictionary.put(COSDictionary.java:621)
    at de.intarsys.pdf.cos.COSBasedObject.cosSetField(COSBasedObject.java:328)
    at de.intarsys.pdf.cos.COSBasedObject.setFieldObject(COSBasedObject.java:682)
    at de.intarsys.pdf.pd.PDPageNode.setParent(PDPageNode.java:414)
    at de.intarsys.pdf.pd.PDPageTree.addNode(PDPageTree.java:107)
    at de.intarsys.pdf.pd.PDPageTree.addNodeAfter(PDPageTree.java:141)
    at de.intarsys.pdf.pd.PDPageTree.addNode(PDPageTree.java:119)
    at WaterMarkingJpod.run(WaterMarkingJpod.java:147)
    at WaterMarkingJpod.runTest(WaterMarkingJpod.java:66)
    at WaterMarkingJpod.main(WaterMarkingJpod.java:57)

     
  • Visitor
    Visitor
    2013-04-16

    I've also tried to copyDeep the COSDict but the result is the same of copyDeep of the document.
    https://friendpaste.com/4eJCjoFlAG8UP1p43DfbJw

     
  • Waldemar Dick
    Waldemar Dick
    2013-04-17

    Hi,

    it took me a while to read up the conversation.

    Ok I've modified the code according to your suggestion
    https://friendpaste.com/69voaS1et3ImpN2P2ymdYK
    but now I got a:
    java.lang.ClassCastException: de.intarsys.pdf.cos.COSDictionary cannot be cast to de.intarsys.pdf.cos.COSStream
       at de.intarsys.pdf.cos.COSBasedObject.cosGetStream(COSBasedObject.java:281)
       at WaterMarkingJpod.run(WaterMarkingJpod.java:146)   

    In line 146 the code states:

    COSStream cosGetStream = pageNode.cosGetStream();
    

    Alle methods starting with cosGet<type> and are implemented in COSBasesObject will cast the object the PDObject is based on to the specified type. A PDPage is bases on a COSDictionary and not on a COSStream thus the ClassCastException.

    I've also tried to use origPage as argument for PDPage.META.createFromCos but I got the error: java.lang.IllegalStateException: You can not merge objects from different documents

    at WaterMarkingJpod.run(WaterMarkingJpod.java:147)

    In line 147 doesn't match, but I assume it is this code:

    COSObject origPage = origPageNode.cosGetObject();
    PDPageNode pageNode = (PDPageNode) PDPageTree.META.createFromCos(origPage);
    

    You get the COSObject form the original document and try to use it in the new document, which is not allowed.
    You could try to copy the object from one document and them place it in the new document.

    But both details won't solve your overall problem. As I understand it, you want to add a content stream (watermark) to every page in the document with the least amount of heap memory possible.

    My first tip: don't try to build a new document out of the original one. This is a difficult task, which can result in missing or duplicating resources. Copy the file first, then modify it. That way resources that are not needed in the watermarking process, won't be loaded into memory.

    If you then try adding one COSStream to the pages /contents entry, it should amount in an incremental write of the page tree, every page and the the COSStream object.

    If you run out on memory, try to rebalance the page tree, as discussed with mtraut, and/or save every x thousand pages.

    It is difficult to optimize this process without knowing anything about the target documents structure.
    To see what changes with each save, just open the PDF file in text editor and take a look at the trailer, which references all new and changed objects in the document.

    Hope this helps.

     
  • Visitor
    Visitor
    2013-04-17

    Ah I understood !
    I don't need to rewrite the stream but just to iterate and change objects inside
    Correct ?

     
  • Waldemar Dick
    Waldemar Dick
    2013-04-18

    A pages content is defined in the dictionary element /Contents. /Contents can contain a single COSStream or an array of COSStreams. Just add the COSStream defining the Watermark to the pages /Contents entry and not to the already existing COSSteam. See PDPage.cosAddContents().

    The only downside of this approach is, that it is quite easy to remove this kind of watermark. But do your customers have that potential?

     
  • Visitor
    Visitor
    2013-04-24

    Everything works better than expected
    On a pdf
    When I switch to another one I got the page watermarked every 10000 pages
    Anyway I implemented a recursive method to walk through the Cos tree but I got the following error:

    de.intarsys.pdf.cos.COSSwapException: parse error reading object 2 0 R
        at de.intarsys.pdf.cos.COSIndirectObject.basicSwapIn(COSIndirectObject.java:209)
        at de.intarsys.pdf.cos.COSIndirectObject.swapIn(COSIndirectObject.java:665)
        at de.intarsys.pdf.cos.COSIndirectObject.dereference(COSIndirectObject.java:296)
        at de.intarsys.pdf.cos.COSDictionary.get(COSDictionary.java:547)
        at de.intarsys.pdf.st.STDocument.updateModificationDate(STDocument.java:1275)
        at de.intarsys.pdf.writer.COSWriter.basicWriteDocument(COSWriter.java:297)
        at de.intarsys.pdf.writer.COSWriter.writeDocument(COSWriter.java:805)
        at de.intarsys.pdf.st.STDocument.save(STDocument.java:1094)
        at de.intarsys.pdf.cos.COSDocument.save(COSDocument.java:656)
        at de.intarsys.pdf.pd.PDDocument.save(PDDocument.java:852)
        at de.intarsys.pdf.pd.PDDocument.save(PDDocument.java:824)
        at SelfWaterMarkingJpod.navigateCosTree(SelfWaterMarkingJpod.java:140)
        at SelfWaterMarkingJpod.run(SelfWaterMarkingJpod.java:112)
        at SelfWaterMarkingJpod.runTest(SelfWaterMarkingJpod.java:50)
        at SelfWaterMarkingJpod.main(SelfWaterMarkingJpod.java:41)
    Caused by: de.intarsys.pdf.parser.COSLoadError: invalid object number at character index 326840031
        at de.intarsys.pdf.parser.COSDocumentParser.parseIndirectObjectKey(COSDocumentParser.java:220)
        at de.intarsys.pdf.parser.COSDocumentParser.parseIndirectObject(COSDocumentParser.java:125)
        at de.intarsys.pdf.st.STXRefEntryOccupied.load(STXRefEntryOccupied.java:115)
        at de.intarsys.pdf.st.STXRefSection.load(STXRefSection.java:296)
        at de.intarsys.pdf.st.STTrailerXRefSection.load(STTrailerXRefSection.java:109)
        at de.intarsys.pdf.st.STDocument.load(STDocument.java:872)
        at de.intarsys.pdf.st.STDocument.load(STDocument.java:862)
        at de.intarsys.pdf.cos.COSIndirectObject.basicSwapIn(COSIndirectObject.java:205)
        ... 14 more
    

    this is the code

    https://friendpaste.com/15GXISzN0XdOcwrApscxZK

    How can I use Jpod to scan all the pages ?

    Thanks
    Marco

     
  • Visitor
    Visitor
    2013-04-24

    My PDF structure is a tree as shown in picture

     
  • Visitor
    Visitor
    2013-04-24

    I've also tried to just change a single page inside the tree
    but the result is a corrupted pdf.
    Using:

                PDForm form = createForm();
    
                i = 0;
    
                COSArray cosKidsL1 = pageTree.cosGetField(PDPageTree.DK_Kids).asArray();
                Iterator iterratorL1 = cosKidsL1.iterator();
                PDPageNode nodeL1 = (PDPageNode) PDPageNode.META.createFromCos((COSObject) iterratorL1.next());
    
                COSArray cosKidsL2 = nodeL1.cosGetField(PDPageTree.DK_Kids).asArray();
    
                Iterator iterratorL2 = cosKidsL2.iterator();
                PDPageNode nodeL2 = (PDPageNode) PDPageNode.META.createFromCos((COSObject) iterratorL2.next());
    
                PDPage page = nodeL2.getFirstPage();
    
                waterMarkThePage(page, form);
    
     
  • Waldemar Dick
    Waldemar Dick
    2013-04-24

    Hi,

    I ran your code on the PDF Specification 1.7 with 1310 pages and 32 MB size. Although the visual outcome was different than expected, the code ran without any problems.
    The good news is, that your code didn't corrupt the document. So, my guess is: maybe your document was corrupt from the beginning?

    Why don't you take a look a the document 's byte position 326840031, where the parser found an illegal object number?

    Caused by: de.intarsys.pdf.parser.COSLoadError: invalid object number at character index 326840031

     
  • Visitor
    Visitor
    2013-04-25

    The document open up correctly before applying the code.
    The corruption occurs with the second code (the one that apply the watermark to a specific page.
    Did you try that also?

     
  • Visitor
    Visitor
    2013-04-25

    anyway the error is that in
    de.intarsys.pdf.parser.COSDocumentParser.parseIndirectObjectKey(IRandomAccess)
    the value of the token at 326840031 is '<'
    and than a NumberFormatException is thrown.

     
  • Visitor
    Visitor
    2013-04-25

    I've updated the code
    https://friendpaste.com/3r6AGc3fAl600NuiWfrb6O
    and executed against the pdf specification as you suggested
    When going  to page  11 I got the corruption error.

     
  • Visitor
    Visitor
    2013-04-25

    and also on pages 31 51 61 71 81 and so on.
    Seams like some nodes of the COS tree are not handled correctly by my code or by the JPOD library.

     
  • Visitor
    Visitor
    2013-04-25

    I've solved the stack trace problem it seams that depends on when I run the save command against the fromDocument.
    Now I'm trying to fix the corruption problem and the logic of the algorithm which is not navigate correctly the tree.
    Do you have any suggestion why?
    Thanks
    Marco

     
  • mtraut
    mtraut
    2013-04-26

    I can't detect any corruption with my local tests.

    As for the navigation: You simply have to decide if you want to iterate

            PDPage page = node.getFirstPage();
            while (page != null) {
                page = page.getNextPage();
                            doPage(page);
            }
    

    or recurse,

        private void visit(PDPageNode node) {
            if (node instanceof PDPage) {
                PDPage page = (PDPage) node;
                doPage(page);
            } else if (node instanceof PDPageTree) {
                PDPageTree tree = (PDPageTree) node;
                Iterator<PDPageNode> it = ((List<PDPageNode>) tree.getKids()).iterator();
                while (it.hasNext()) {
                    visit(it.next());
                }
            } else {
                // well this would be really strange
            }
        }
    

    as you like/need, but not both.

    In addition, maybe you should think about getting professional assistance.

     
  • Visitor
    Visitor
    2013-04-27

    Indeed I'm really interested in getting professional assistance
    How can I get it ?

     
1 2 > >> (Page 1 of 2)