jPod intarsys PDF library / Discussion / Help: WaterMark Large Spool File

Visitor - 2013-04-02

Hi all,
I'm evaluating the possibility to use jpod to add a watermark in each page of a big document:
100.000 Pages
500Mb
I would avoid to load the whole stream in memory
can I do it with JPod
Thanks.
Marco

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Visitor - 2013-04-02

To be specific I can only use 256Mb of memory for the whole run.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

mtraut - 2013-04-03

In theory, yes.

You can walk through the page tree, change each page, SAVE (incremental). Then the CosIndirectObject should be eligible to garbage collection as it is held via SoftReference only.

As this is not standard stuff in our usage scenarios, this MAY fail first because of references that still hold the pages in memory, but it should not. As far as i remember there have been issues in your usage regarding the font registry? Anyway, IF you have problems AND come up with issues that are tracked down to a garbage collection root for this scenario, i'm glad to fix this (better: try).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Visitor - 2013-04-11

hi
I tried to do the following

https://friendpaste.com/3YzlS3bvTcXEPK3FZkl7nb

with a pdf of 400Mb

but CDSRectangle rect = pageTree.getPageAt(i).getCropBox();

takes ages to be executed just at the first step and already get 800Mb of RAM.

What did I missed ?
Thanks for your support.
Marco

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Visitor - 2013-04-12

I rewrote it
but the problem persist
I can't find a way to iterate over the pages without reading all of them.
Does it exists ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

mtraut - 2013-04-12

1) Looking at the jPod implementation should reveal very easily that PDPageTree.getPageAt() is quite inefficient to use in this case as it is a "linked tree" structure, not direct access. It will start from page one every time. To navigate through pages use "next" and previous".

2) Then instead of reusing the page you have, you start looking up all over again in the loop. This is at least 3 times quadratic cost. This will not speed up use.

3) There's no reason to parse and serialize the page content as you do. You want to add another layer, regardless of old content. Just add the old stream and the new stream.

4) manipulating the old page will surely keep it in memory from now on, while being completely unnecessary.

5) Creating the new page tree efficiently was already the topic of a 25 message long conversation some days ago - with you. Maybe you look it up.

Look up comments in your code.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Visitor - 2013-04-12

Yes I restarted the code from the balanced tree page done with the test of some days ago
Now I ended up at this point:
https://friendpaste.com/69voaS1et3ImpN2P2ymdYK

I have 2 question
The bottleneck now is the deepCopy which is mandatory otherwise I'll get an error regarding the fact that I cannot copy objects from one locator to another
the other question is how can I merge the PDForm with a page ?

Thanks
Marco

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

mtraut - 2013-04-15

"deepCopy" on a page is never a good idea, as it has back references. You copy the whole document this way.

You can simply copy the CONTENT STREAM of the old page (check resources beforehand) and add the stream to the new page. the watermark will be another stream for the new page.

even better would be to manipulate the original document directly by simply adding (prepending) the watermark.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Visitor - 2013-04-16

Ok I've modified the code according to your suggestion
https://friendpaste.com/69voaS1et3ImpN2P2ymdYK

but now I got a:
java.lang.ClassCastException: de.intarsys.pdf.cos.COSDictionary cannot be cast to de.intarsys.pdf.cos.COSStream
at de.intarsys.pdf.cos.COSBasedObject.cosGetStream(COSBasedObject.java:281)
at WaterMarkingJpod.run(WaterMarkingJpod.java:146)
at WaterMarkingJpod.runTest(WaterMarkingJpod.java:66)
at WaterMarkingJpod.main(WaterMarkingJpod.java:57)

This is probably due to the fact that I don't know how to implement your suggestion :)
Could you help me please?
Thanks
Marco

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Visitor - 2013-04-16

I've also tried to use origPage as argument for
PDPage.META.createFromCos
but I got the error:

java.lang.IllegalStateException: You can not merge objects from different documents
at de.intarsys.pdf.cos.COSIndirectObject.registerWith(COSIndirectObject.java:514)
at de.intarsys.pdf.cos.COSCompositeObject.register(COSCompositeObject.java:250)
at de.intarsys.pdf.cos.COSIndirectObject.associate(COSIndirectObject.java:189)
at de.intarsys.pdf.cos.COSIndirectObject.addContainer(COSIndirectObject.java:178)
at de.intarsys.pdf.cos.COSDictionary.basicPutPropagate(COSDictionary.java:242)
at de.intarsys.pdf.cos.COSDictionary.put(COSDictionary.java:621)
at de.intarsys.pdf.cos.COSBasedObject.cosSetField(COSBasedObject.java:328)
at de.intarsys.pdf.cos.COSBasedObject.setFieldObject(COSBasedObject.java:682)
at de.intarsys.pdf.pd.PDPageNode.setParent(PDPageNode.java:414)
at de.intarsys.pdf.pd.PDPageTree.addNode(PDPageTree.java:107)
at de.intarsys.pdf.pd.PDPageTree.addNodeAfter(PDPageTree.java:141)
at de.intarsys.pdf.pd.PDPageTree.addNode(PDPageTree.java:119)
at WaterMarkingJpod.run(WaterMarkingJpod.java:147)
at WaterMarkingJpod.runTest(WaterMarkingJpod.java:66)
at WaterMarkingJpod.main(WaterMarkingJpod.java:57)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Visitor - 2013-04-16

I've also tried to copyDeep the COSDict but the result is the same of copyDeep of the document.
https://friendpaste.com/4eJCjoFlAG8UP1p43DfbJw

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Waldemar Dick - 2013-04-17

Hi,

it took me a while to read up the conversation.

Ok I've modified the code according to your suggestion
https://friendpaste.com/69voaS1et3ImpN2P2ymdYK
but now I got a:
java.lang.ClassCastException: de.intarsys.pdf.cos.COSDictionary cannot be cast to de.intarsys.pdf.cos.COSStream
at de.intarsys.pdf.cos.COSBasedObject.cosGetStream(COSBasedObject.java:281)
at WaterMarkingJpod.run(WaterMarkingJpod.java:146)

In line 146 the code states:

COSStream cosGetStream = pageNode.cosGetStream();

Alle methods starting with cosGet<type> and are implemented in COSBasesObject will cast the object the PDObject is based on to the specified type. A PDPage is bases on a COSDictionary and not on a COSStream thus the ClassCastException.

I've also tried to use origPage as argument for PDPage.META.createFromCos but I got the error: java.lang.IllegalStateException: You can not merge objects from different documents
…
at WaterMarkingJpod.run(WaterMarkingJpod.java:147)

In line 147 doesn't match, but I assume it is this code:

COSObject origPage = origPageNode.cosGetObject(); PDPageNode pageNode = (PDPageNode) PDPageTree.META.createFromCos(origPage);

You get the COSObject form the original document and try to use it in the new document, which is not allowed.
You could try to copy the object from one document and them place it in the new document.

But both details won't solve your overall problem. As I understand it, you want to add a content stream (watermark) to every page in the document with the least amount of heap memory possible.

My first tip: don't try to build a new document out of the original one. This is a difficult task, which can result in missing or duplicating resources. Copy the file first, then modify it. That way resources that are not needed in the watermarking process, won't be loaded into memory.

If you then try adding one COSStream to the pages /contents entry, it should amount in an incremental write of the page tree, every page and the the COSStream object.

If you run out on memory, try to rebalance the page tree, as discussed with mtraut, and/or save every x thousand pages.

It is difficult to optimize this process without knowing anything about the target documents structure.
To see what changes with each save, just open the PDF file in text editor and take a look at the trailer, which references all new and changed objects in the document.

Hope this helps.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Visitor - 2013-04-17

Ah I understood !
I don't need to rewrite the stream but just to iterate and change objects inside
Correct ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Waldemar Dick - 2013-04-18

A pages content is defined in the dictionary element /Contents. /Contents can contain a single COSStream or an array of COSStreams. Just add the COSStream defining the Watermark to the pages /Contents entry and not to the already existing COSSteam. See PDPage.cosAddContents().

The only downside of this approach is, that it is quite easy to remove this kind of watermark. But do your customers have that potential?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Everything works better than expected
On a pdf
When I switch to another one I got the page watermarked every 10000 pages
Anyway I implemented a recursive method to walk through the Cos tree but I got the following error:

de.intarsys.pdf.cos.COSSwapException: parse error reading object 2 0 R
    at de.intarsys.pdf.cos.COSIndirectObject.basicSwapIn(COSIndirectObject.java:209)
    at de.intarsys.pdf.cos.COSIndirectObject.swapIn(COSIndirectObject.java:665)
    at de.intarsys.pdf.cos.COSIndirectObject.dereference(COSIndirectObject.java:296)
    at de.intarsys.pdf.cos.COSDictionary.get(COSDictionary.java:547)
    at de.intarsys.pdf.st.STDocument.updateModificationDate(STDocument.java:1275)
    at de.intarsys.pdf.writer.COSWriter.basicWriteDocument(COSWriter.java:297)
    at de.intarsys.pdf.writer.COSWriter.writeDocument(COSWriter.java:805)
    at de.intarsys.pdf.st.STDocument.save(STDocument.java:1094)
    at de.intarsys.pdf.cos.COSDocument.save(COSDocument.java:656)
    at de.intarsys.pdf.pd.PDDocument.save(PDDocument.java:852)
    at de.intarsys.pdf.pd.PDDocument.save(PDDocument.java:824)
    at SelfWaterMarkingJpod.navigateCosTree(SelfWaterMarkingJpod.java:140)
    at SelfWaterMarkingJpod.run(SelfWaterMarkingJpod.java:112)
    at SelfWaterMarkingJpod.runTest(SelfWaterMarkingJpod.java:50)
    at SelfWaterMarkingJpod.main(SelfWaterMarkingJpod.java:41)
Caused by: de.intarsys.pdf.parser.COSLoadError: invalid object number at character index 326840031
    at de.intarsys.pdf.parser.COSDocumentParser.parseIndirectObjectKey(COSDocumentParser.java:220)
    at de.intarsys.pdf.parser.COSDocumentParser.parseIndirectObject(COSDocumentParser.java:125)
    at de.intarsys.pdf.st.STXRefEntryOccupied.load(STXRefEntryOccupied.java:115)
    at de.intarsys.pdf.st.STXRefSection.load(STXRefSection.java:296)
    at de.intarsys.pdf.st.STTrailerXRefSection.load(STTrailerXRefSection.java:109)
    at de.intarsys.pdf.st.STDocument.load(STDocument.java:872)
    at de.intarsys.pdf.st.STDocument.load(STDocument.java:862)
    at de.intarsys.pdf.cos.COSIndirectObject.basicSwapIn(COSIndirectObject.java:205)
    ... 14 more

this is the code

https://friendpaste.com/15GXISzN0XdOcwrApscxZK

How can I use Jpod to scan all the pages ?

Thanks
Marco

Visitor - 2013-04-24

My PDF structure is a tree as shown in picture

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

I've also tried to just change a single page inside the tree
but the result is a corrupted pdf.
Using:

            PDForm form = createForm();

            i = 0;

            COSArray cosKidsL1 = pageTree.cosGetField(PDPageTree.DK_Kids).asArray();
            Iterator iterratorL1 = cosKidsL1.iterator();
            PDPageNode nodeL1 = (PDPageNode) PDPageNode.META.createFromCos((COSObject) iterratorL1.next());

            COSArray cosKidsL2 = nodeL1.cosGetField(PDPageTree.DK_Kids).asArray();

            Iterator iterratorL2 = cosKidsL2.iterator();
            PDPageNode nodeL2 = (PDPageNode) PDPageNode.META.createFromCos((COSObject) iterratorL2.next());

            PDPage page = nodeL2.getFirstPage();

            waterMarkThePage(page, form);

Waldemar Dick - 2013-04-24

Hi,

I ran your code on the PDF Specification 1.7 with 1310 pages and 32 MB size. Although the visual outcome was different than expected, the code ran without any problems.
The good news is, that your code didn't corrupt the document. So, my guess is: maybe your document was corrupt from the beginning?

Why don't you take a look a the document 's byte position 326840031, where the parser found an illegal object number?

Caused by: de.intarsys.pdf.parser.COSLoadError: invalid object number at character index 326840031

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Visitor - 2013-04-25

The document open up correctly before applying the code.
The corruption occurs with the second code (the one that apply the watermark to a specific page.
Did you try that also?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Visitor - 2013-04-25

anyway the error is that in
de.intarsys.pdf.parser.COSDocumentParser.parseIndirectObjectKey(IRandomAccess)
the value of the token at 326840031 is '<'
and than a NumberFormatException is thrown.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Visitor - 2013-04-25

I've updated the code
https://friendpaste.com/3r6AGc3fAl600NuiWfrb6O
and executed against the pdf specification as you suggested
When going to page 11 I got the corruption error.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Visitor - 2013-04-25

and also on pages 31 51 61 71 81 and so on.
Seams like some nodes of the COS tree are not handled correctly by my code or by the JPOD library.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Visitor - 2013-04-25

I've solved the stack trace problem it seams that depends on when I run the save command against the fromDocument.
Now I'm trying to fix the corruption problem and the logic of the algorithm which is not navigate correctly the tree.
Do you have any suggestion why?
Thanks
Marco

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

I can't detect any corruption with my local tests.

As for the navigation: You simply have to decide if you want to iterate

        PDPage page = node.getFirstPage();
        while (page != null) {
            page = page.getNextPage();
                        doPage(page);
        }

or recurse,

    private void visit(PDPageNode node) {
        if (node instanceof PDPage) {
            PDPage page = (PDPage) node;
            doPage(page);
        } else if (node instanceof PDPageTree) {
            PDPageTree tree = (PDPageTree) node;
            Iterator<PDPageNode> it = ((List<PDPageNode>) tree.getKids()).iterator();
            while (it.hasNext()) {
                visit(it.next());
            }
        } else {
            // well this would be really strange
        }
    }

as you like/need, but not both.

In addition, maybe you should think about getting professional assistance.

Visitor - 2013-04-27

Indeed I'm really interested in getting professional assistance
How can I get it ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

WaterMark Large Spool File

Forums

Help

WaterMark Large Spool File document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

WaterMark Large Spool File