Stream (not in-memory) PDF manipulation

2010-10-05
2013-01-26
  • Chris Thielen
    Chris Thielen
    2010-10-05

    Hi,

    I'm curious about the most recent entry on the blog.  Are you working on manipulation of documents without a fully-in-memory model, via streams?  If so, is the trunk repository available for testing?  I have a specific use case which would benefit greatly from stream processing. 

    My use case involves concatenating (and bookmarking) many PDF files (possibly thousands) together, then streaming the resulting PDF to a browser.  In-memory model works, but is obviously very memory intensive.  In my simple testing, generation of a 150mb PDF requires approximately 320mb of java heap, which really isn't bad for an in-memory model, but can make serving multiple concurrent requests problematic.

    Thanks for your effort, I'm really impressed with PDFClown so far!

     
  • Hi,

    I'm perfectly aware of your concerns about the library's scalability in a server context: you're absolutely right. My primary design focus has been on offering a rich, flexible and consistent document object model; however, a pure in-memory representation has the obvious drawback of a larger footprint that isn't suitable for heavy concurrent computing.

    To cope with such a limitation I'm thinking about a hybrid solution which could integrate the current DOM with a stream manager deputed to progressively serialize and dispose indirect objects as soon as they are complete - users could choose, according to their requirements and coding strategies, to keep all the model in-memory, or to stream it both on reading and writing.

    By the way, the blog entry you referred to doesn't mention file serialization streaming: it's about cross-reference streams and object streams instead, which are a PDF 1.5 structure optimization for better data compression.

    Thank you
    Stefano

     
  • Chris, thank you  for your contribution; I'll try it in the next few days.
    Stefano