Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

Write Large PDF

Help
Visitor
2013-02-15
2013-05-28
1 2 > >> (Page 1 of 2)
  • Visitor
    Visitor
    2013-02-15

    HI,
    I\'m trying to write large pdf file without using much memory
    but  the test.pdf file is always 0 until the end of the process
    it seams that it\\\'s using ram before closing the stream and writing to disk
    any suggestion?

    this is my sample code:

    https://gist.github.com/visik7/4961145

     
  • mtraut
    mtraut
    2013-02-15

    1) setting the write mode hint is a one shot call. it is reset after each write, so this belongs in the loop.
    2) using different locator instances on the PDDocument level when writing FORCES a full write - this emulates the Adobe Reader "Save As" (still, this is overriden by the write mode hint if you so this in the loop).
    3) As the location does not really change, a simple "save()" is fine. This will default to an incremental write.
    4) The STDocument uses internally a 4k buffer. If you stay within this range, nothing gets written until flush.
    5) If i debug your code, is see the size of the "text.pdf" slowly build up in my explorer.

    So all in all, you are not using incremental mode. The serialized PDF is definitely not (completely) in memory. But, keep in mind, your memory consumption is not only dependent on the PDF serialization. The PD abstraction handles many and  large objects. jPod is designed to swap out these objects when no longer used but this also depends on your usage (references that create paths to gc roots).

     
  • Visitor
    Visitor
    2013-02-19

    HI,
    I've updated the code according to your suggestions but the write generation is still quite slow:
    2 seconds on my core i7 with SSD disk for 1000 pages.
    We actually produce 25000 page per second with some proprietary tecnology that we want to dismiss
    the sample code runs at 350 pg/sec.
    What could we optimize to solve this ?

     
  • mtraut
    mtraut
    2013-02-19

    Well, i apologize, but actually there's no visionary around here :-)

    But, being earnest. I don't know what your old library is doing and if it can be compared at all. We do a lot of stuff that is not of any importance in batch creation because we are interaction focused. This is surely dead freight if generating docs in the background.

    If you're really interested in speeding up you should profile… if you come up with specific questions i can tell you why we did it this way or if we have simply made some mistakes.

    In addition i can provide you with the insight of some other user that gave us feedback (that is not yet included in the distribution, so you may take advantage of it)

    • Gained significant performance boost when reusing the same NumberFormat object in COSWriter.java. This is due to the localization service being expensive to create for every number written out.

    • We also overrode PDFPage.toByteArray() and in that method we created a new RandomAccessByteArray object that defaulted the byte array to 100,000 bytes. We thought there may be a way to default this array size based on the number of operations. This is so we don’t keep creating new byte arrays. We observed over four million byte arrays being created.

    If these changes sound like something you’d be interested in reviewing I can start working on making a patch.

    We’ve also found that we gained a lot of performance improvements related to garbage collection pauses were if we set the following JVM arg:

    -XX:SoftRefLRUPolicyMSPerMB=25

    We thought maybe you or other jPod users may be interested in this optimization.

    If you have similiar (ore more) findings, please come back and let us know.

     
  • Visitor
    Visitor
    2013-02-19

    another issue is that for 100.000 pages it took over than 10gb. I think there should be some problems in resource reusing but I dunno Jpod internals, maybe the library is not designed for such kind of tasks?
    Thanks for the support.

     
  • mtraut
    mtraut
    2013-02-19

    I never created a document with 100.000 pages (and maybe one should revisit the requirements ?) so i did not track down possible problems (btw. 10GB of main memory or what??)

    In theory jPod should be able to handle large documents as it is random access based and can swap out unused indirect objects.

    That said, as this is not in our focus, we do not test and there may be some defect here. As designed, after (incremental) writing, each object should have cleared the "dirty" state and as such be subject to garbage collection (if indirectly referenced). Keeping a reference to the objects you built up will havoc this mechanic. The only object structure that always is in memory is the xref. But again, i do not exclude a memory leak, so you have to start your profiler….

    Another hint: when creating large page nodes, you should think about "rebalancing" the tree in multiple hierarchies.

     
  • Visitor
    Visitor
    2013-02-19

    no no 10gb of FS document

     
  • Visitor
    Visitor
    2013-02-20

    about  the patch yes I'm definitly interested in it if you can arrange it I can do some tests.
    Thanks,

     
  • mtraut
    mtraut
    2013-02-20

    If you write 100000 pages in incremental mode without page balancing, you can simply sum up:

    - You (re)write 100000 times the page tree node, first 1 entry, then 2 ,…
    - Math 101: n * (n-1) /2 * size of indirect object - this easily sums up to multiple gigabytes..

    -> Split your page tree, the document will otherwise not be usable anyway

     
  • mtraut
    mtraut
    2013-02-20

    As already said: i can provide you with the insight, not a patch. There's definitely no time frame here for this. The changes should be obvious, anyway.

    (remember, the biggest savings should come from the page tree, anyway. Just look in the serialized PDF and try to verify your page tree array).

     
  • Visitor
    Visitor
    2013-02-20

    ah ok I understood from this:
    If these changes sound like something you’d be interested in reviewing I can start working on making a patch.
    \
    that you would create a patch anyway yes the changes are obvious.

    for the balanced tree I have understood what the problem is
    but I have also to admit that I've some difficulties to understand what to do. 
    something like :
    (pseudocode)
    document.addPageNode(page);
    page.addPageNode(page2)
    ecc … ?

    Is it possible to engage you or your company for support this task ?
    Thanks
    Marco

     
  • mtraut
    mtraut
    2013-02-20

    I'm sorry, the text was only a citation from the other guy that already worked on the topic.

    The simplest way to avoid the problem will be:

    - for every chunk of 10 or 100 pages
    - - create a page tree node

    PDPageTree newNode = (PDPageTree) PDPageTree.META.createNew();
    

    - - add to document
    - - add each new page to the intermediate node
    - - incremental save (do not save for every page)

    if the page tree node still gets too wide, make the tree still deeper. Another simple way is to create a "degenerate" tree like the one created by PDPageTree#rebalance.

    Yes, our company can do consulting on PDF workflows. I think this depends on the context….

     
  • Visitor
    Visitor
    2013-02-20

    Code update:
    https://gist.github.com/visik7/4961145

    Here some results:
    for 50000 pages 8.4 seconds
    for 100000 pages 24 seconds
    for 150000  pages 45 seconds

    There is a slow degradation
    What do you think about it ?

     
  • Visitor
    Visitor
    2013-02-21

    other tests results that I really don't understand:
    The output of the code (updated https://gist.github.com/visik7/4961145 )
    is :
    3996.8025579536375 pag/sec
    8826.1253309797 pag/sec
    12674.27122940431 pag/sec
    12820.51282051282 pag/sec
    11185.682326621923 pag/sec
    13192.612137203167 pag/sec

    Which doesn't really make any sense

    by the way the last 3 run are the half  of the desired results so I'm not too far from what I need
    but
    if I bring the  iterations from 10000 to 50000 the performances drops to this:

    4692.632566870014 pag/sec
    5582.226191805292 pag/sec
    5479.45205479452 pag/sec
    5481.855059752221 pag/sec
    5561.116672227783 pag/sec
    5504.183179216204 pag/sec

    Which doesn't make really any sense.

    So here I've 2 issues:
    1st I need 2 runs to get the maximum performance
    2nd Increasing the number of pages affect performances.

    Any Clue?
    thanks
    Marco

     
  • mtraut
    mtraut
    2013-02-21

    The code and the resulting PDF look fine from a "standard" point of view. The increments are more or less the same size and few objects are rendered in a redundant way.

    If you applied already the tips from our other user, you NEED TO PROFILE… There may arise penalities from garbage collection (because of our SoftReference caching), you may detect that collections are resized in an unperformant way for this bulk or other deficiencies that show up only with large docs.

    Then come back.

     
  • Visitor
    Visitor
    2013-02-21

    I put a debugger inside NumberFormat and it's not used at all
    PDFPage class doesn't exists so I didn't find any toByteArray
    CosWriter has a toByteArray but already use RandomAccessByteArray
    I've also added -XX:SoftRefLRUPolicyMSPerMB=25 (and test with other values) no effect.

    Profiling seams that the majority of the time is spent in:
    CosObjectWalkerShallow.

    https://dl.dropbox.com/u/175137/jpod-profile/profile.html

    But I have no clue on how to fix it.

     
  • mtraut
    mtraut
    2013-02-21

    OK - by PDFPAge i think our fellow talked about CSContent. The default is to create a small (20 bytes) ByteArray that very slowly resizes. You will encounter the problem when writing mor page content. The same holds for COSWriter. You will not have any damage when only writing small objects…

    But, the biggest "issue" is our internal garbage collection. I never thought about this. It is a typical artifact from interactive PDF handling, where, between two saves, some objects may already be out of scope. Therefore we do a cleanup to avoid having dangling objects.

    In your scenario (as the old pages still exist in memory, even if swapped out) this leads to O(n*n) penality. And, best of all, you don't need it. Just go and delete the call to "incrementalGarbageCollect" in COSWriter (not garbageCollect).

     
  • Visitor
    Visitor
    2013-02-21

    I've seen that in COSWriter and CSContent RandomAccessByteArray is already used.
    For the other incrementalGarbageCollect I've removed it, despite the fact that profiling has changed like this:

    https://dl.dropbox.com/u/175137/jpod-profile/profile2.html

    There is a strange recursion

    the time result is this:

    200000 pages Total time: 0:02:01.200
    4125.412 pg/sec

    the real strange thing is that there is a performance degradation that you can see from this output:

    0:00:01.499 3335.5570380253503 p/s
    0:00:00.710 7042.253521126761 p/s
    0:00:00.668 7485.029940119761 p/s
    0:00:00.469 10660.980810234541 p/s
    0:00:00.458 10917.03056768559 p/s
    0:00:00.413 12106.537530266343 p/s     BEST
    0:00:00.429 11655.011655011656 p/s
    0:00:00.489 10224.948875255624 p/s
    0:00:00.448 11160.714285714286 p/s
    0:00:00.451 11086.474501108647 p/s
    0:00:00.475 10526.315789473685 p/s
    0:00:00.485 10309.278350515464 p/s
    0:00:00.501 9980.039920159681 p/s
    0:00:00.512 9765.625 p/s
    0:00:00.535 9345.794392523365 p/s
    0:00:00.557 8976.660682226213 p/s
    0:00:00.679 7363.7702503681885 p/s
    0:00:00.666 7507.507507507507 p/s
    0:00:00.600 8333.333333333334 p/s
    0:00:00.604 8278.145695364237 p/s
    0:00:00.613 8156.606851549755 p/s
    0:00:00.634 7886.435331230284 p/s
    0:00:00.656 7621.951219512195 p/s
    0:00:00.651 7680.491551459293 p/s
    0:00:00.681 7342.1439060205585 p/s
    0:00:00.680 7352.941176470588 p/s
    0:00:00.689 7256.894049346879 p/s
    0:00:00.735 6802.7210884353735 p/s
    0:00:00.759 6587.615283267457 p/s
    0:00:00.865 5780.346820809249 p/s
    0:00:00.779 6418.485237483954 p/s
    0:00:00.797 6273.525721455459 p/s
    0:00:00.943 5302.226935312831 p/s
    0:00:00.833 6002.400960384153 p/s
    0:00:00.852 5868.544600938967 p/s
    0:00:00.849 5889.281507656066 p/s
    0:00:00.864 5787.037037037037 p/s
    0:00:00.898 5567.928730512249 p/s
    0:00:00.892 5605.3811659192825 p/s
    0:00:00.918 5446.623093681917 p/s
    0:00:01.041 4803.073967339097 p/s
    0:00:00.933 5359.056806002144 p/s
    0:00:01.135 4405.286343612334 p/s
    0:00:00.997 5015.045135406219 p/s
    0:00:01.010 4950.495049504951 p/s
    0:00:01.011 4945.598417408506 p/s
    0:00:01.054 4743.833017077799 p/s
    0:00:01.054 4743.833017077799 p/s
    0:00:01.053 4748.338081671415 p/s
    0:00:01.085 4608.294930875576 p/s
    0:00:01.109 4508.566275924256 p/s
    0:00:01.119 4468.275245755138 p/s
    0:00:01.113 4492.362982929021 p/s
    0:00:01.152 4340.277777777777 p/s
    0:00:01.166 4288.164665523156 p/s
    0:00:01.151 4344.0486533449175 p/s
    0:00:01.199 4170.141784820684 p/s
    0:00:01.496 3342.245989304813 p/s
    0:00:01.241 4029.0088638195007 p/s
    0:00:01.394 3586.800573888092 p/s
    0:00:01.264 3955.6962025316457 p/s
    0:00:01.277 3915.4267815191856 p/s
    0:00:01.424 3511.2359550561796 p/s
    0:00:01.370 3649.6350364963505 p/s
    0:00:01.313 3808.073115003808 p/s
    0:00:01.318 3793.626707132018 p/s
    0:00:01.331 3756.5740045078887 p/s
    0:00:01.357 3684.598378776713 p/s
    0:00:01.532 3263.7075718015667 p/s
    0:00:01.443 3465.003465003465 p/s
    0:00:01.618 3090.2348578491965 p/s
    0:00:01.429 3498.9503149055286 p/s
    0:00:01.721 2905.2876234747237 p/s
    0:00:01.450 3448.275862068965 p/s
    0:00:01.724 2900.232018561485 p/s
    0:00:01.486 3364.737550471063 p/s
    0:00:01.788 2796.420581655481 p/s
    0:00:01.540 3246.753246753247 p/s
    0:00:01.914 2612.3301985370954 p/s
    0:00:01.812 2759.381898454746 p/s
    0:00:01.879 2660.989888238425 p/s
    0:00:01.959 2552.322613578356 p/s
    0:00:02.182 2291.4757103574702 p/s
    0:00:01.649 3032.1406913280775 p/s
    0:00:01.619 3088.3261272390364 p/s
    0:00:01.638 3052.5030525030525 p/s
    0:00:01.625 3076.923076923077 p/s
    0:00:02.180 2293.5779816513764 p/s
    0:00:02.273 2199.7360316761988 p/s
    0:00:01.953 2560.1638504864313 p/s
    0:00:02.277 2195.8717610891526 p/s
    0:00:02.361 2117.746717492588 p/s
    0:00:02.375 2105.2631578947367 p/s
    0:00:01.790 2793.2960893854747 p/s
    0:00:02.048 2441.40625 p/s
    0:00:02.135 2341.9203747072597 p/s
    0:00:01.919 2605.5237102657634 p/s
    0:00:02.104 2376.425855513308 p/s
    0:00:02.146 2329.9161230195714 p/s
    0:00:02.416 2069.5364238410593 p/s
    Total time: 0:02:01.200   4125.412541254126 p/s

    What do you think about it ?

     
  • Visitor
    Visitor
    2013-02-21

    as you can see from the code I've also added this:

    document.close();
    document = PDDocument.createFromLocator(new FileLocator("test.pdf"));

    and it has increased  the performances really much (70% Faster).

     
  • mtraut
    mtraut
    2013-02-21

    wtf are you doing?

    you re-read the document from scratch and its getting faster???? skip this blunder…. What you see in the profile is the backtracking from reading the nested incremental xref sections in the pdf.

    And yes, i know that CSContent and COSWriter are using RAndomAccessByteArray. The optimization is to pre-size the byte array they act upon.

    what is the result when removing the "incrementalGarbageCollect" only?

     
  • Visitor
    Visitor
    2013-02-21

    here it is
    with just incrementalGarbageCollect removed

    https://dl.dropbox.com/u/175137/jpod-profile/profile3.html

    by the way the performance degradation is still present  (code updated https://gist.github.com/visik7/4961145 )

    this is the output

    0:00:01.792 5580.357142857143 p/s
    0:00:00.868 11520.73732718894 p/s
    0:00:00.781 12804.097311139565 p/s
    0:00:01.018 9823.18271119843 p/s
    0:00:00.666 15015.015015015015 p/s
    0:00:00.692 14450.86705202312 p/s
    0:00:00.800 12500.0 p/s
    0:00:01.035 9661.835748792271 p/s
    0:00:00.677 14771.048744460855 p/s
    0:00:00.883 11325.028312570781 p/s
    0:00:00.774 12919.896640826873 p/s
    0:00:01.268 7886.435331230284 p/s
    0:00:00.922 10845.986984815618 p/s
    0:00:00.796 12562.81407035176 p/s
    0:00:00.768 13020.833333333334 p/s
    0:00:00.991 10090.817356205851 p/s
    0:00:01.765 5665.7223796033995 p/s
    0:00:00.829 12062.726176115802 p/s
    0:00:00.962 10395.010395010395 p/s
    0:00:00.870 11494.252873563217 p/s
    0:00:00.932 10729.613733905579 p/s
    0:00:01.053 9496.67616334283 p/s
    0:00:00.992 10080.645161290322 p/s
    0:00:00.991 10090.817356205851 p/s
    0:00:02.207 4531.037607612143 p/s
    0:00:00.936 10683.760683760684 p/s
    0:00:00.933 10718.113612004288 p/s
    0:00:01.142 8756.567425569177 p/s
    0:00:01.031 9699.321047526673 p/s
    0:00:01.067 9372.071227741331 p/s
    0:00:01.138 8787.346221441125 p/s
    0:00:01.044 9578.544061302682 p/s
    0:00:01.017 9832.84169124877 p/s
    0:00:01.187 8424.599831508003 p/s
    0:00:02.608 3834.355828220859 p/s
    0:00:01.206 8291.873963515754 p/s
    0:00:01.251 7993.605115907275 p/s
    0:00:01.150 8695.652173913042 p/s
    0:00:01.059 9442.870632672331 p/s
    0:00:01.209 8271.298593879239 p/s
    0:00:01.098 9107.468123861567 p/s
    0:00:01.085 9216.589861751152 p/s
    0:00:03.192 3132.8320802005014 p/s
    0:00:01.315 7604.5627376425855 p/s
    0:00:01.256 7961.783439490446 p/s
    0:00:01.365 7326.007326007326 p/s
    0:00:01.136 8802.81690140845 p/s
    0:00:01.194 8375.209380234504 p/s
    0:00:01.333 7501.875468867217 p/s
    0:00:01.264 7911.392405063291 p/s
    0:00:01.532 6527.415143603133 p/s
    0:00:01.803 5546.311702717693 p/s
    0:00:04.068 2458.2104228121925 p/s
    0:00:01.253 7980.845969672785 p/s
    0:00:01.470 6802.7210884353735 p/s
    0:00:01.552 6443.298969072165 p/s
    0:00:08.545 1170.2750146284377 p/s
    0:00:13.771 726.1636772928618 p/s
    0:00:16.263 614.8927012236364 p/s
    0:00:19.201 520.8062080099994 p/s
    0:00:19.759 506.0984867655246 p/s
    0:00:22.574 442.98750775228143 p/s

    the Profile and the output are synced

     
  • Visitor
    Visitor
    2013-02-21

    by the way I just used the code from de.intarsys.pdf.example.pd.doc.AppendPage
    to append a new pdf to an old one.
    does it reread everything?

     
  • mtraut
    mtraut
    2013-02-21

    I think this is nearly as good as it gets without detailed optimizations.

    The degradation i bet is stemming from increasing garbage collection activities. You can verify this in your profiler. An brute force check may be to change CosIndirectObject.soften to

    public void soften(COSObject pObject) {
            // is fixed?
            if ((flags & F_FIXED) != 0 && ((COSObject) object).mayBeSwapped()) {
                flags ^= F_FIXED;
                object = null;
            }
        }
    

    This will get rid of the weak references you dont want anyway.

    In the profile you see nearly only the internal association management (which again is needed for optimal interactive manipulation).

    If you reuse the font reference for example you may save another couple of seconds.

    "AppendPage" does certainly read documents, but not in a loop. What exactly do you mean?

     
  • Visitor
    Visitor
    2013-02-21

    Thanks
    I'll do the tests tomorrow
    for the AppendPage nevermind

     
1 2 > >> (Page 1 of 2)